- Some theoretical aspects

- Implementing a Markovian model

- Examples

- Command-line options and additional tools

The

Main definition

Applied to genomic sequences, the random variable stands for the base in the sequence. The Markovian model constrains the occurrence probability for a base in a given context composed of the previously assigned letters, therefore weakly

What about heterogeneity ?

Formally, an heterogenous model allows the probabilities associated with a variable to depend on any of the variables prior to . Our subset of the heterogenous Markovian models will use an integer parameter called to compute the probabilities for . The phase of a variable is simply given by the formula . In our subset of the heterogenous Markovian models, the probabilities for depend on both context and phase of the variable. Such an addition is useful to model phenomenons in which variables are gathered in packs of consecutive variables, and their values not only depend of their contexts, but also of their relative position inside the pack. Such are sequences of protein-coding DNA, where the position of a base inside of a base triplet is well-known to be of interest.

It is also possible to simulate such phase alternation by introducing dummy letters encoding the phase in which they will be produced. For instance, an order 0 model having 3 phases (for coding DNA...) would result in a sequence model over the alphabet {,,,,,,,,,,,} having non-null transition probabilities only for , and . Therefore, the expressivity of our heterogenous models do not exceed that of the homogenous ones, but it is much more convenient to write these models using an heterogenous syntax.

Hidden Markovian Models (HMMs)

An hidden Markovian model is a combination of a top-level Markovian model and a set of bottom-level Markovian models, called hidden states. The generation process associated with an HMM initiates the sequences using a random hidden state. At each step of the generation, the algorithm may switch to another hidden using probabilities from the top-level model, and then emits a symbol using probabilities related to the current urn.

Once again, this class of models' expressivity seems to exceed that of the classical Markovian models. However, in our context, it is possible to emulate an hidden model with a classical one just by duplicating the alphabet so that the emitted character also contains the state which it belongs to.

Sets the

Sets the

Chooses the type of symbols to be used for random generation.

When

A

Defines the frequencies associated with various eligible prefixes for the generated sequence.

Each is either a sequence of symbols separated by white spaces or a word, depending on the value of the

If omitted, the beginning of the sequence is chosen according to the distribution of

Defines the probabilities of emission for the different symbols.

Each is either a sequence of symbols separated by white spaces or a word, depending on the value of the

Defines the hidden Markovian model's probabilities of emission.

First, a master model is defined for the alternation of the hidden states. , , are sequences of hidden states , separated or not by whitespaces depending on the previously defined value for the clause

Then, a model definition is required for each hidden states through the following syntax:

Once such a model is defined, a sequence is issued starting from a random hidden state . At each step of the generation, the process is allowed to move from the current hidden state to another hidden state (potentially ) and then emits a symbol according to the probabilities of the hidden state . The process is iterated until the expected number of symbols are generated.

In

In

In

Clause

In

In

In

Clause

In

In

In

Clause

Clause

- -
- First a random start
`atg`is chosen with probability . - -
- The context, here composed of the two most recently emitted bases, is now
`tg`, and the phase^{3.3}of the next base is . The emission probabilities for the bases of a`g`is . Similarly, probabilities of emissions for the other bases are , and . - -
- After a call to a random number generator, a
`g`base is chosen and emitted. - -
- The new context is then
`gg`, and the new phase is 0. We then consider new probabilities for the bases`a,c,g`and`t`:**gg**a**25**22 17**gg**c**15**8 21**gg**g**14**12 15**gg**t**17**26 16

In

In

Part

Part

: Toggles rejection of models with dead-ends**-m [TF]***on*(`T`) and*off*(`F`). Notice that disabling the rejection may result in shorter sequences than that specified by mean of the*size*parameter.**Defaults to -m T.**

BuildMarkov: Automatic construction of

- -
*InputFiles*is a set of paths aiming at sequences that will be used for Markov model construction.**Required(At Least one)**.

`-d`*OutputFile*: Outputs the description file to the file`OutputFile`.`-p`: Defines the number of phases in the markov model. must be . means that the model will be homogenous.**Defaults to =1.**`-o`: Defines the order of the markov model. must be . means that the model will be a bernoulli model.**Defaults to =0.**`-v`: Verbose mode, show the progress of the construction(useful for large files).