GenRGenS v2.0 User Manual
Yann Ponty Alain Denise1
LRI, UMR CNRS 8623
Université Paris-Sud XI
Yann.Ponty@lri.fr Alain.Denise@lri.fr
Random sequences can be
used to extract relevant information from biological sequences. The random
sequences represent the ``background noise'' from which it is possible to differentiate the real
biological information. Random sequences
are widely used to detect over-represented and under-represented motifs,
or to determine whether the scores of pairwise alignments are
relevant. Analytic approaches exist for solving these kinds of problems
(see e.g. [9].) although for the most complex cases, an
experimental approach (i.e. the computer generation of random
sequences) is still necessary.
Some programs are already currently available for generating random
sequences. For example, the GCG package contains a few generation tools,
such as HmmerEmit that generates sequences according to HMM profiles,
and Corrupt that adds random mutations to a given
sequence [3]. Seq-Gen randomly simulates the evolution of
nucleotide sequences along a phylogeny [1]. The Expasy
server has RandSeq, which generates random amino acid
sequences according to a Bernoulli
process [7]. Shufflet is a program that generates random
shuffled sequences [4]. However, until now, there has been no software
package that can integrate several statistical and syntaxical models of
random sequences and combine them. This is the purpose of
GenRGenS.
The random sequence models currently handled by GenRGenS
are the following:
- MARKOV : Markovian random generation. Puts probabilistic constraints on the occurences of
-mers among
the generated sequences. A markovian model can be automatically built from real biological data by a tool
bundled with GenRGenS
- GRAMMAR : Random generation based on context-free grammars. This syntaxic formalism allows to take
into account both sequential and structural constraints. Most long-range interactions and correlations
can be captured by this formalism.
- MASTER : Random generation of hierarchical sequences. Combines different levels of abstraction. Sequences of symbols
are generated using a master description file and then some of these symbols are rewrited into sequences generated
with respect to auxiliary description files and distributions over sequences lengths.
- RATEXP : Prosite patterns and rational expressions. Generates sequences uniformly at random from a rational expression
or a prosite pattern. Long ago, searches in language theory stated that these formalisms' expressivity are
included in that of the context-free grammars'1.1. However,
more efficient generation algorithms are available for this subclass. Moreover, support for Prosite patterns allows
a copy/paste approach for random generation that some may find convenient.
This section is dedicated to the installation and usage of GenRGenS v2.0.
GenRGenS performs random generation from a given sequences model.
Models are passed to the main generation engine through description
files. These are text files meeting certain syntax criteria.
Until further improvements GenRGenS supports four classes of models.
Each of these models' description file has a specific syntax, although they all
share certain properties.
Foreword: GenRGenS' current implementation uses Java, so it will be necessary to download
and install a version of Java's runtime environment (or virtual machine) superior to 1.1.2, freely downloadable from http://java.sun.com.
GenRGenS latest versions (binaries+sources) can be found at the following URL:
http://www.lri.fr/bio/GenRGenS
First download the bundle file found in the download section of the site.
If needed, uncompress the archive into an appropriate directory, using free or shareware tools like 7zip (.ZIP files), Java bundled tool
jar (.JAR files) or GNU tar (.TAR files) into a root directory of your choice. We'll assume for the next sections that the chosen directory
is GenRGenSDir.
After decompression of the archive, move to GenRGenSDir and open a shell to invoke the Java virtual machine through the following command:
java -cp . GenRGenS.GenRGenS [options] [-nb
] -size
DescriptionFile
where:
- -
is the number of sequences to be generated by GenRGenS. Defaults
to
.
- -
is an indicative length for the generated sequences. Depending on the class of models, it can either be the exact length, an upper bound or just ignored
depending on the type of generation. Required.
- -
- DescriptionFile is the path to a description file describing the random sequence model. Required.
Specific options may be available for some classes of models and will be detailed further.
There are two ways to run GenRGenS interactive version, one is achieved through command line and the other uses file
associations supported by certain platforms:
- -
- From Shell: Move to GenRGenSDir and, at the prompt, enter the following command:
|
java -cp . GenRGenS.GenRGenS |
or |
java -jar GenRGenSvX.X.JAR |
- -
- From a graphical environment: Some operating systems maintain associations between files
and eligible executables, based on file name suffixes or file header analysis. Under these platforms,
a simple double-click on the JAR archive will execute the main application.
Figure 2.1:
Screenshot of the main GUI
|
Once GenRGenS UI is run, a typical random generation scenario would be:
- Open a description file, using 'File/Open Description File...' or 'File/Reopen' menu items or clicking on button A
- Define some required/optional parameters, such as the sequence length or the number of sequences, inside the
Generation Configuration window spawned from 'Utilities/GenRGenS' menu item or by clicking on button C.
- Validate generation. Random sequences are generated and displayed on text buffer D. Potential errors or warnings are shown on text buffer E.
- Save generated sequences to disk, using 'Buffers/Save Output Buffer Content...' or by cliquing on button B.
Figure 2.2:
Defining the generation parameters
|
The Generation Configuration window defines of a few additional parameters:
- Box 1: The size of the emitted sequences or an upper-bound depending on the type of generation.
- Box 2: The number of generated sequences.
- Checkbox 3: Whether or not to display informations about the generation during the process.
- Checkbox 4: Toggles displaying of generated sequences on and off. Useful when large numbers
of sequences are required.
- Messagebox 5: Selects the output file.
- Panel 6: Display a number of generator-specific option. Here, for a markovian generator, it is possible
to provide the size parameter (Box 1) as an upper bound for the sequence size. The sequence is then
generated letter by letter, until a dead-end is encountered, i.e. there is no sequel for this sequence in this model,
or the size provided is reached.
Description files describe random models to the generation engine of GenRGenS. They are composed of clauses, each defining a parameter
in the random model. The syntax of description files will be detailed below, along with the most common clauses.
GenRGenS description files are sequences of clauses. All clauses are based on the pattern Param_Name = Param_Value,
where Param_Name is the name of the parameter being defined and Param_Value its value. Parameters available are
specific to a given random model, although some parameters are shared by most if not all description files types. Clauses
must be ordered in a generation-specific way, otherwise they will be rejected by GenRGenS' main engine. A clause can be optional: if omitted, its
associated parameter will default to a value detailed in the random model's description part of this document.
TYPE = {MARKOV,GRAMMAR,RATEXP,MASTER}
The TYPE clause is the first clause of any description file. It defines the type of random model to be used for generation.
Currently supported values for FileType are listed below:
TYPE |
Random model description |
MARKOV |
Markovian random generation. |
GRAMMAR |
Random generation based on context-free grammars. |
MASTER |
Random generation of hierarchical sequences. |
RATEXP |
Prosite patterns and rational expressions. |
Another common clause is the ALIAS clause. It causes GenRGenS to substitute the right
hand sides of the equalities to the left hand sides after
the generation is performed. This clause simplifies the writing of large random models while
keeping the output explicit, as one can write the whole model using letters and substituting
more explicit symbols to letters afterward. See chapter 3 for advanced use of this clause.
Figure 2.3:
A simple Markovian description file
![\begin{figure}\texttt{
\begin{center}
\begin{tabular}{\vert l\vert l\vert} \hl...
... t = U \end{tabular}\ [0.2cm] \hline
\end{tabular} \end{center} } \end{figure}](img21.png.gif) |
For instance, here is the toy example of a description file describing a simple Markovian model:
Clause 1 defines the class of random model to be used for random generation. Here, a Markovian
model is choosen, thus raising a need for the definition of various parameters whose roles a explained below.
Clause 2 defines the order of the Markovian model. A 0 value stands for a Bernoulli model, i.e. the probability of emission
of a letter doesn't depend on letters priorly emitted.
Further details and definitions can be found on chapter 3.
Between clause 2 and clause 3, an optional clause PHASE = int is omitted.
Default value
will then be used for the number of phases, thus defining an homogenous Markovian model.
Clause 3 defines the emission probabilities for the various symbols. Instead of asking the user to provide the
probabilities for the different k-mers, we preferred to compute the probabilities from given numbers of occurrences. This approach is well fit for
the use of a Markovian profile built from a real sequence.
Here, we are using the DNA bases A,C,G and T. Here, Adenosine(A) will be emitted
with a
probability.
Clause 4 performs-post generation rewriting of the sequence. Here, it allows generation of RNA sequences
from a DNA-dedicated Markovian model 2.1.
The MARKOV package : Markovian models
Markovian models are the simplest, easiest to use statistical models available for genomic sequences.
Statistical properties associated with a Markovian model make it become a valuable tool to the one who wants
to take into account the occurrences of
-mers in a sequence.
Their most commonly used version, the so-called classical Markovian models, can be automatically built
from a set of real genomic sequences. Hidden Markovian models are now supported by
GenRGenS at the cost of some preprocessing, as shown in section 3.1.3.
Apart from genomics, such models appear in various scientific fields including-but-not-limited-to
speech recognition, population processes, queuing theory and search engines3.1.
Main definition
Formally, a classical Markovian model applied to a set of random variables
causes the probabilities associated
with the potential values for
to depend on the values of
. We will focus on homogenous Markovian
models, where the probabilities for the different values for
are conditioned by the values already chosen for a small subset
of variables from the past, also called context of
. The parameter
is called the order
of the Markovian model. Moreover, the probabilities of the values for
in an homogenous Markovian model cannot in any way be
conditioned by the index
of the variable.
Figure 3.1:
The probability of the event : "the
-th base is an
" depends on its
predecessors.
|
Applied to genomic sequences, the random variable
stands for the
base in the sequence. The Markovian model
constrains the occurrence probability for a base
in a given context composed of the
previously assigned
letters, therefore weakly3.2! constraining the proportions of each
-mers.
What about heterogeneity ?
GenRGenS handles a small subset of the heterogenous Markovian class of
models.
Formally, an heterogenous model allows the probabilities
associated with a variable
to depend on any of the variables prior to
. Our subset of the heterogenous Markovian
models will use an integer parameter called
to compute the probabilities for
. The phase of a variable
is simply given by the formula
. In our subset of the heterogenous Markovian models, the
probabilities for
depend on both context and phase of the variable. Such an addition is useful to model
phenomenons in which variables are gathered in packs of
consecutive variables, and their values not only
depend of their contexts, but also of their relative position inside the pack. Such are sequences of protein-coding DNA, where
the position of a base inside of a base triplet is well-known to be of interest.
It is also possible to simulate such phase alternation by introducing dummy letters encoding the phase
in which they will be produced. For instance, an order 0 model having 3 phases (for coding DNA...) would
result in a sequence model over the alphabet {
,
,
,
,
,
,
,
,
,
,
,
}
having non-null transition probabilities only for
,
and
.
Therefore, the expressivity of our heterogenous models do not
exceed that of the homogenous ones, but it is much more convenient
to write these models using an heterogenous syntax.
Hidden Markovian Models (HMMs)
Hidden Markovian models address the hierarchical decomposability of most sequences.
An hidden Markovian model is a combination of a top-level Markovian model and a set of bottom-level Markovian models,
called hidden states. The generation process associated with an HMM initiates the sequences using a random hidden state.
At each step of the generation, the algorithm may switch to another hidden using probabilities from the top-level model,
and then emits a symbol using probabilities related to the current urn.
Once again, this class of models' expressivity seems to exceed that of the classical Markovian models.
However, in our context, it is possible to emulate an hidden model with a
classical one just by duplicating the alphabet so that the emitted character also contains the state
which it belongs to.
This section describes the syntax and semantics of Markovian description files, as shown in figure 3.2.
Figure 3.2:
Main structure of a Markovian description file
![\begin{figure}\begin{center}
\texttt{\begin{tabular}{l}
TYPE = MARKOV \ [0.1c...
...IES] = ...\ [0.1cm]
[ALIASES = ...]
\end{tabular}}
\end{center} \end{figure}](img50.png.gif) |
Clauses nested inside square brackets are optional. The given order for the clauses is mandatory.
Markovian description files allows definition of the Markovian model parameters.
ORDER =
Required
Sets the order of the underlying Markovian model to a positive integer value
.
The order of a Markovian model is the number of previously emitted symbols taken into account
for the emission probabilities of the next symbol.
See section 3.1.1 for more details about this parameter.
PHASE =
Optional, defaults to PHASE = 0
Sets the number of phases parameter of GenRGenS' heterogenous Markovian models subclass to a positive non-null integer value
.
See section 3.1.2 for more details about this parameter.
SYMBOLS = {WORDS,LETTERS}
Optional, defaults to SYMBOLS = WORDS
Chooses the type of symbols to be used for random generation.
When WORDS is selected, each pair of symbols must be separated by at least a blank character (space, tabulation or newline). A Markovian
description file written using WORDS will then be easier to read, as explicit names for symbols can be used, but may take a little longer
to write, as illustrated by the toy example of the FREQUENCIES clause's content for a simple
Markovian description file modelling the ORF/Intergenic alternation at a base-triplet level
:
Figure 3.3:
A simple Markovian model for the Intergenic/ORF
alternation
 |
A LETTER value for the SYMBOLS parameter will force GenRGenS to see a word as a sequence of symbols. For instance,
the definition of a
/
/
/
occurrences proportions for the symbols A/C/G/T in a context ACC
will be expressed by the following statement inside of the FREQUENCIES clause :
Optional, defaults to the distribution of k-mers, k being the value
(see below).
Defines the frequencies associated with various eligible prefixes for the generated sequence.
Each
is either a sequence of symbols separated by white spaces or a word, depending on the value
of the SYMBOLS parameter. Every
must be composed of the same amount
of symbols,
.
If omitted, the beginning of the sequence is chosen according to the distribution of k-mers implied by
the content of the FREQUENCIES clause and computed as follows :
where
is the set of symbols used inside of the sequence,
is the set of words of size
, and
is the probability
of emission of
in a context
as defined by the FREQUENCIES clause. Note that no specific order is required
for definition of a START clause.
This clause is used to define the probabilities of emission
of the Markovian model.
Required
Defines the probabilities of emission for the different symbols.
Each
is either a sequence of symbols separated by white spaces or a word, depending on the value
of the SYMBOLS parameter. Each
is composed of
symbols,
being the order. The first
symbols define the context
and the last letter a candidate to emission. The relationship between the frequency definition
and the probability
of emitting
in a context
,
is given by the following formula :
The (simple) idea behind this formula is that the probability of emission of a base in a certain context equals to the context/base,
concatenation's frequency divided by the sum of the frequencies sharing the same context .
Choice has been made to write frequencies instead of direct probabilities because most of the Markovian models encountered in genomics are
built from real data by counting the occurrences of all
-mers, thus allowing the direct injection of the counting process' result
into the description file. As for the START clause, it is not necessary to provide values for the FREQUENCIES in a specific
order.
HMMFREQUENCIES =
;
:
;
:
;
;
,
,
Required
Defines the hidden Markovian model's probabilities of
emission.
First, a master model
is
defined for the alternation of the hidden states.
,
,
are
sequences of hidden states
, separated or not by whitespaces depending on the
previously defined value for the clause SYMBOLS.
are sequences of integers,
representing the frequencies of the corresponding sequence. Their lengths equal the phase parameter of the
model, as provided by the PHASE clause.
Then, a model definition is required for each hidden states
through the following syntax:
Each
is composed of symbols
, which are part of the emission vocabulary.
Once such a model is defined, a sequence is issued starting from a random hidden state
. At each step
of the generation, the process is allowed to move from the current hidden state
to another hidden state
(potentially
) and then emits a symbol according to the probabilities of the hidden state
.
The process is iterated until the expected number of symbols are generated.
Remark 1: Each model definition must be ended by a semicolon ; to avoid ambiguity.
Remark 2: The vocabularies
and
respectively used
for the master and the hidden states model definition must be disjoint.
Remark 3: The numbers of phases for heterogenous master and hidden states model must be equal.
Remark 4: Different orders for the master and states models are supported.
We illustrate the design of a Markovian model with a description file generated automatically from the
-th human chromosome, using the tool
BuildMarkov bundled with GenRGenS and described in section 3.4.2, through the following command:
java -cp . GenRGenS.markov.BuildMarkov -o 0 -p 1 q22.fasta -d q22.ggd
This model is a Bernoulli one, as no memory is involved here, i.e. the probability of emission of a base doesn't depend on events from the past.
Figure 3.4:
A Bernoulli model for the entire
-th human chromosome(cleaned up from unknown N bases.)
 |
In 1, we choose a Markovian random generation.
In 2, we choose an amnesic model, i.e. there is no relation between two bases probabilities.
In 3, we choose an homogenous model, the probabilities of emission won't depend on the index of the symbol
among the sequence.
In 4, we use letters as symbols for simplicity sake.
Clause 5 provides definition for the frequencies. Here, the A and T are slightly more likely
to occur than C and G. Precisely, at each step an A symbol will be emitted with probability
.
This description file has been built automatically from a portion of the
q
part of the 17-th human chromosome by
the tool BuildMarkov invoked through the following command:
java -cp . GenRGenS.markov.BuildMarkov -o 1 -p 1 17-q11.fasta -d 17-q11.ggd
Figure 3.5:
Order 1 Markov model for the sequence of the mRNA encoding the S-protein involved in serum spreading.
 |
In 1, we choose a Markovian random generation.
In 2, we choose an order 1 model, i.e. the probability of emission of a symbol only depends on the symbol
immediately preceding in the sequence.
In 3, we choose an homogenous model, the probabilities of emission won't depend on the index of the symbol
among the sequence.
In 4, we illustrate the use of WORDS. The symbols must be separated in the FREQUENCIES by
at least one whitespace or carriage return. Space characters will be inserted between symbols in the output.
Clause 5 provides definition for the frequencies, subsequently for the transition/emission probabilities.
The following description file describes a Markovian model for the first chromosome's
q
area of the human genome,
built by BuildMarkov through :
java -cp . GenRGenS.markov.BuildMarkov -o 2 -p 3 1q31.1.fasta -d 1q31.1.ggd
and edited afterward to include the START clause.
Figure 3.6:
A Markovian description file for the
q
area of the first human
chromosome
 |
In 1, we choose a Markovian random generation.
In 2, we will consider the frequencies of
base triplets, e.g. the probability of emission of a base is conditioned by the
last bases.
In 3, we differentiate the behaviour of the Markovian process for each phase, in order
to capture some properties of the coding DNA.
In 4, we use letters as symbols for simplicity sake.
Clause 5 initiates each sequence
with an atg base triplet with probability
or chooses gtg with probability
.
Clause 6 provides definition for the frequencies. To explain the semantics of this clause, we will
focus on some particular frequencies definitions :
agg 19 18 22 aga 47 30 40 agt 43 28 43 agc 30 24 22
Writing agg 19 means that base g has occurred
times in an ag context on phase 0.
In other words, the motif agg has occurred
times on phase
. We will illustrate the link with the emission
probability of g in a context ag in the next section.
- -
- First a random start atg is chosen with probability
.
- -
- The context, here composed of the two most recently emitted bases, is now tg, and the phase3.3 of the next base is
. The emission probabilities for the bases of a g is
.
Similarly, probabilities of emissions for the other bases are
,
and
.
- -
- After a call to a random number generator, a g base is chosen and emitted.
- -
- The new context is then gg, and the new phase is 0. We then consider new
probabilities for the bases a,c,g and t:
gga 25 22 17 ggc 15 8 21 ggg 14 12 15 ggt 17 26 16
The probabilities of emission for the different bases are then
,
,
and
Figure 3.7:
Basic HMM excerpted from the sequence
analyst's bible[6]
|
Consider the basic output of a HMM profile building algorithm drawn in Figure 3.7.
It can be translated into the following
GenRGenS input file:
Figure 3.8:
A Markovian description file for the
q
area of the first human
chromosome
 |
In 1, we choose a Markovian random generation.
In 2, an order of
is defined for the master model.
In 3, we choose to manipulate words as symbols.
Part 4 is a definition for the frequencies of the master Markovian model. For instance, the probability of choosing
L as the next hidden state while in state J is
.
Remark: The word
START is a keyword and cannot be used as an identifier.
Part 5 contains the different emission frequencies, converted from probabilities to integers.
Inside some markovian models, it is theoreticaly possible that some states may be dead-ends, the frequencies
of all candidate symbols summing to 0 in this context. It is then impossible to emit an extra symbol. That is why
GenRGenS always checks weither each reachable state can be exited. The default behaviour of GenRGenS
is to refuse to generate sequences in such a case, as the generated sequences are supposed to be of the size provided
by the user, and it is senseless to keep on generating letters when a dead-end is encountered. However, this
phenomenon may be intentional, in order for instance to end the generation with a specific word (perhaps a STOP
codon...) if total control over the size can be neglected.
Markovian specific command line option usage:
java -cp . GenRGenS.GenRGenS -size n -nb k -m [T
F] MarkovianGGDFile
- -m [T
F]: Toggles rejection of models with dead-ends on (T) and off (F).
Notice that disabling the rejection may result in shorter sequences than that specified by mean of the size parameter.
Defaults to -m T.
BuildMarkov: Automatic construction of MARKOV description files
This small software is dedicated to the automatic construction of MARKOV description files
from a sequence. It scans the sequence(s) provided through the command line and evaluates the
probabilities of transitions from and to each markov states.
After decompression of the GenRGenS archive, move to GenRGenSDir
and open a shell to invoke the Java virtual machine through the following command:
java -cp . GenRGenS.markov.BuildMarkov [options] InputFiles
where:
- -
- InputFiles is a set of paths aiming at sequences that will be used for Markov model construction. Required(At Least one).
The following parameters/options can also be specified :
- -d OutputFile: Outputs the description file to the file OutputFile.
- -p
: Defines the number of phases in the markov model.
must be
.
means that
the model will be homogenous. Defaults to
=1.
- -o
: Defines the order of the markov model.
must be
.
means that
the model will be a bernoulli model. Defaults to
=0.
- -v
: Verbose mode, show the progress of the construction(useful for large files).
The input files can be flat files, that are text files containing a raw sequence, or FASTA
formatted files, that are an alternation of title lines and sequences, and usually look like the figure below.
Figure 3.9:
A typical FASTA formatted file, containing several sequences.
 |
Remark1: Inside a sequence definition, both for FASTA and flat files,
BuildMarkov ignores the spaces, tabulations and carriage returns
so that the sequence can be indented in a user-friendly way.
Remark2: When several sequences are found inside a FASTA file,
they are considered as different sequences and processed to build the Markov model.
If the FASTA file contains a sequence split into contigs, thus needing
to be concatenated, then a preprocessing of the file is required to remove the title
lines before it can be passed as an argument to BuildMarkov.
For many years, the context free grammars have been used in the theoretical languages field
They seem to be the best compromise between computability and expressivity. Applied to genomics,
they can model long range interactions, such as base pairings inside an RNA, as shown by D.B. Searls[12]. We provide in this
package generation algorithms for two classes of models based on Context Free Grammars: the uniform models and
the weighted models. In the former, each sequence of a given size has the same probability of being drawn. In the
latter, each sequence's probability is proportional to its composition.
Like stochastic grammars, a weight set can be guessed to achieve specific frequencies. Unlike stochastic grammars,
one can control precisely the size of the output sequences.
The term context-free for our grammars equals to a restriction over the form of the rules. This restriction is
based on language theoretical properties.
A sample of context-free grammar, and the way a sequence is produced from the axiom can be seen in figures 4.1
and 4.2.
Figure 4.1:
A context free grammar for the well-balanced words
 |
Figure 4.2:
Derivation of the word
from the axiom
.
are the freshly produced letters.
are the letters that will be rewritten at next step.
 |
In the computer science and combinatorics fields, words issued from grammars are used as codes
for certain objects, like computer programs, combinatorial objects, ...For instance, the grammar
whose rules are given in figure 4.1 naturally encodes well balanced parenthesis words, if
stands for a
opening parenthesis and
stands for a closing one . A one to one correspondence has also
been shown between the words generated by this grammar and some classes of trees. As trees are also
used as approximations of biological structures, it is possible to consider the use of context-free grammars as discrete
models for genomic structures.
We now discuss the utility of context-free grammars as discrete models for genomic structures. Most of the commonly-used
models do only take into account a smallest subset of their neighborhood. For instance, in a Markov model, the probability
of emission of a letter depends only on a fixed number of previously emitted letters. Although these models can prove
sufficient when structural data can be neglected, they fail to capture the long-range interaction in structured
macromolecules such that RNAs or Proteins. Back to our CFGs, figure 4.2 illustrates the fact that letters
produced at the same step can at the end of the generation be separated by further derivations. As we claim that dependencies
inside a CFG-based model rely on some terminal symbols proximity inside of the rules, CFGs can capture both
sequential and structural properties, as illustrated by figure 4.3.
Figure 4.3:
Long range interaction between bases 1 and 2 can be captured by CFGs, as they are generated at the same stage,
and then separated by further non-terminal expansions. The CFG used here is an enrichment of the one shown in Figure 4.2
|
The uniform random generation algorithm, adapted from a more general one[8], is briefly described here.
At first, we motivate the need for a precomputation stage, by pointing
out the fact that a stochastic approach is not sufficient to achieve uniform generation. Moreover, controlling the size
of the output using stochastic grammar is already a challenge in itself.
To illustrate this point, we introduce the historical grammar for RNA structures, whose rules are described
in figure 4.4.
Figure 4.4:
A CFG grammar for the RNA Secondary Structures
 |
Now, suppose one wants to generate a word of size
uniformly at random from this grammar.
A natural idea would be to rewrite from the axiom
until a word of size
is obtained.
Therefore, you have to check wether each derivation yields a word of the desired length, as,
for instance, the
rule produces only the empty word and needs not
to be investigated.
Once you have built a set of derivations suitable for a given length, you have to carefully
associate probabilities with each derivation. Indeed, choosing any of the eligible
rules with
probability
may end in a biased generation, as the numbers of sequences accessible from
any rules are not equal. This fact is illustrated by the example from figure 4.4,
where the two eligible rules for sequences of size
from the axiom
are
(1) and
(2).
Figure 4.5:
Deriving words of size
from
|
As illustrated by figure4.5,
choosing rules (1) or (2) with equal probabilities is equivalent to grant the set of words
produced by rules (1) and (2) with total probabilities
and
. As these sets do not
have equal cardinalities, uniform random generation cannot be achieved.
That is why the grammar is first analyzed during a preprocessing stage, that precomputes the
probabilities of choosing rules at any stage of the generation. These probabilities are proportional
to the number of sequences accessible after the choice of each applicable rule for a given non-terminal,
normalized by the total number of sequence for this non-terminal.
Once these computations are made, it is possible to perform uniform random generation by choosing a
rewriting using the precomputed probabilities, as shown in figure 4.6.
Figure 4.6:
The complete tree of probabilities for the words of size
.
|
Weighted languages for controlled non-uniform random generation
Although uniform models can prove useful in the analysis of algorithms, or to shed the light on some overrepresented phenomena,
they are not always sufficient to model biological sequences or structures. For instance, it can be shown using combinatorial tools that
the classical model for RNA secondary structures from figure 4.4 produces on the average
one base-pair-long helices (see [10] or [11]), which is not realistic.
Furthermore, substituting the rule
with
a rule
in order to constrain the minimal number of base pairs inside an helix only raises
the average length of an helix to exactly
, loosing all variability during the process.
Thus, we proposes in [5] a way to control the values of the parameters of interest by putting weights
on the terminal letters
. We then define the weight of a sequence to be the product of the individual weights
of its letters. For instance, in a weighted CFG model having weights
,
and
, a sequence
has a weight of
.
By making a few adjustments during the precomputation stage, it is possible to constrain
the probability of emission of a sequence to be equal to its weight, normalized by the sum of weights of all sequences of the same size.
Under certain hypothesis, it can be proved that for any expected frequencies for the terminal
letters, there exists weights such that the desired frequencies are observed. By assigning heavy weights
to the nucleotides inside helices, and low ones to helices starts, it is now possible to control the
average size of helices, terminal loops, bulges and multiloops.
To illustrate the effect of such weightings, we consider the previous example from figure 4.5, and we point the fact that the
average number of unpaired bases equals to
in the uniform model. By assigning a weight
to each unpaired base
letter, we obtain the probabilities of emission for sequences of size
summarized in table 4.7.
Figure 4.7:
Probabilities for all words of size
under
 |
These probabilities yield a new expectancy for the number of unpaired bases in a random structure of
.
So, by assigning a weight barely higher than
(
), it is possible to constrain the average number of the unpaired base symbol
c to be
out of
. This property holds only for sequences of size
, but another weight can be computed for any
sequence size. Moreover, we claim that, under certain common sense hypothesis, the frequency of a given symbol quickly reaches an
asymptotic behaviour, so that the weight do not need to be adapted to the sequence size above a certain point.
This section describes the syntax and semantics of context-free grammar based description files.
Figure 4.8:
Main structure of a context-free grammar description file
![\begin{figure}\begin{center}
\texttt{\begin{tabular}{l}
TYPE = GRAMMAR \ [0.1...
...TS = ...] \ [0.1cm]
[ALIASES = ...]
\end{tabular}}
\end{center} \end{figure}](img157.png.gif) |
Clauses nested inside square brackets are optional. The given order for the clauses is mandatory.
Context-free grammar based description files define some attributes and properties of the grammar.
SYMBOLS = {WORDS,LETTERS}
Optional, defaults to SYMBOLS = WORDS
Chooses the type of symbols to be used for random generation.
When WORDS is selected, each pair of symbols must be separated by at least a blank character (space, tabulation or newline).
This affects mostly the RULES clause, where the right-hand-sides of the rules will be made of words, separated by blank
characters. This feature can greatly improve the readability of the description file as shown in the example from figure 4.9.
Figure 4.9:
The skeleton of a grammar for the tRNA
 |
A LETTERS value for the SYMBOLS parameter will force GenRGenS to see a word as sequence of symbols. For instance,
the right hand side of the rule S->aSbS will decompose into S -> a S b S if this option is invoked. Moreover,
blank characters will be inserted between symbols in the output.
START =
Optional, defaults to START =
, with
being the left hand side of the first
rule defined by the RULES clause.
Defines the axiom
of the grammar, that is the non-terminal letter initiating the generation. This letter
must of course be the left-hand-side of some rule inside the RULES clause below.
Required
Defines the rules of the grammar.
Rules are composed of letters, also called symbols. Each symbol
that appears as the left-hand side of a rule becomes
a non-terminal symbol, that is a symbol that needs further rewriting.
are sequences of symbols that can weither appear
or not as the left-hand side of rules, separated by spaces or tabs if SYMBOLS=LETTERS is selected.
If the START clause is omitted, the axiom for the grammar will be
.
The characters ::= can be used instead of -> to separate the left-hand side
symbols from the right-hand side ones.
Optional, weights default to
.
Defines the weights of the terminal letters
.
As discussed in section 4.3, assigning a weight
to a terminal letter
is a way
to gain control over the average frequency of
. For instance, in a grammar over the alphabet
, adding a clause WEIGHTS = a 2 will grant sequences aaab and bbbb with
weights
and
, i.e. the generation probability of aaab will be
times higher than that of
bbbb.
The following description file captures some structural properties of the tRNA.
Figure 4.10:
A basic CFG-based model for the tRNA
 |
In 1, we choose a context free grammar-based model.
In 2, we use words, a reader-friendly choice for such a complicated(though still a little simplistic) model.
In 3, we choose the non-terminal symbol tRNA as our starting symbol. Every generated sequence will
be obtained from iterative rewritings of this symbol. Here, this clause has no real effect, as the default behaviour of
GenRGenS is to choose the first non-terminal to appear in a rule as the starting symbol.
Clause 4 describes the set of rules. The correspondance between non-terminal symbols an substructures
is detailled in figure 4.11.
Figure 4.11:
Non-terminal symbols and substructures
|
Note that we use different alphabets to discern paired bases from unpaired ones, which allows discrimination
of unpaired bases using weights in clause 5.
Of course, it is possible to get a more realistic model by substituting the X_Loop -> Fill rules with
specific set of rules, using different alphabets. Thus, one would be able to control the size and composition of each
loop.
In 5, we discriminate the unpaired bases, which increases size of the helices. Indeed, uniformity tends to favor
very small if not empty helices whereas the weighting scheme from this clause penalizes so much the occurences of unpaired
bases that the generated structures now have very small loops.
In 6, the sequence's alphabet is unified by removing traces of the structure.
During the precomputation stage, GenRGenS uses arbitrary precision arithmetics to handle numbers that can grow over
the size in an exponential way. However, for special languages or small sequences, it is possible to avoid using
time-consumming arithmetics, and to limit the computation to double variables. This is the purpose of the
-f option. The -s option toggles on and off the computation and display of
frequencies of symbols for each generated sequence.
Grammar specific command line options usage:
java -cp . GenRGenS.GenRGenS -size n -nb k -f [T
F] -s [T
F] GrammarGGDFile
- -f [T
F]: Activates/deactivates the use of light and limited double variables instead of
heavier but more accurate BigInteger ones. Misuse of this option may result in division by zero or non-uniform generation.
Defaults to -f F.
- -s [T
F]: Enables/disables statistics about each sequence.
Defaults to -s F.
At the bottom of Chomsky's formal languages hierarchy lies the regular expressions. Taken as languages theory tools, their main
purpose is the description of subsets over the whole set of sequences that can be parsed and/or generated by very simple machines called
finite state automata over finite alphabets. This class is known to be a subclass of the one that is implemented by the
context-free grammar, but more efficient random generation algorithms are available for it.
Because of their simplicity, they have been extensively used to model families of genomic sequences
differing only on a few positions. In the Prosite database, which is well known database of protein families and domains, a pattern
(or consensus) is built from a given set of sequences, sharing functional properties. These so-called Prosite patterns are related
to regular expressions, as shown in 5.1.3.
GenRGenS provides a random generation process both for Prosite patterns and regular expressions.
For instance, random sequences drawn with respect to a Prosite pattern supposedly being a fingerprint for a biological
property can be used to test its relevance. One can also generate simple mutants from a regular expression by introducing
some choice places inside the sequence. Such sequences can used for computation of statistical scores, such as Z-Scores, and P-values.
In this chapter, we will describe the syntax and semantics of the regular expressions. We will then explain
how GenRGenS turns a ProSite pattern into a regular expression associated with the same language, and show how
uniform random generation can be performed from a regular expression.
Regular expressions syntax
A regular expression
is a language description tool, recursively defined as follows :
Where
and
are regular expressions, and
stands for any letter among the alphabet.
The
is a shortcut for an empty word(a word of size 0).
Formally, a language can be seen as a set of words over a given alphabet.
The meanings of these alternatives are related to the languages denoted by such expressions.
- -
- The disjonction
: The language is the union of the languages associated to
and
. Any word of
belongs
to
or
.
- -
- The concatenation
: The language is the concatenation of the languages associated to
and
. Any word among
can be decomposed into a concatenation of words issued from
and
.
- -
- The iteration
: Each word of the resulting language is a concatenation of a finite set of words issued from
.
- -
: The language is that of
. This construction is useful to avoid ambiguity. For instance, the expression
can denote the language
or the language
.
- -
-
: The only word among the language is the single character
, resp.
.
Example 1:
The regular expression
describes a simplistic model for an ORF, that is a subsequence of a DNA code starting
with a START base triplet
and ending with one of the STOP base triplets TAG/TGA/TAA.
It should be noticed that anything can happen between the START and STOP codon, as the
part of
doesn't excludes STOP base triplets.
Example 2:
Such expressions are also perfectly fit to model mutants. Suppose you're given three 5.8S ribosomal RNA sequences close one to another and
aligned as follows.
... |
C |
G |
C |
C |
C |
C |
G |
C |
C |
G |
G |
C |
G |
G |
... |
... |
A |
C |
G |
C |
G |
A |
C |
C |
C |
G |
G |
U |
G |
G |
... |
... |
C |
C |
U |
G |
U |
U |
. |
G |
U |
G |
G |
U |
G |
G |
... |
A regular expression
can then be used to model a family of sequences that contains the three original sequences, among with some
other sequences close to the originals.
PROSITE is a database of protein families and domains.It consists of biologically significant sites, patterns and
profiles that help to identify to which known protein family a new sequence belongs. It was started in 1989 by Amos
Bairoch, and is part of the SWISS-PROT program[2]. It features text-files structured by tags and carriage-returns.
Among these tags, some are used to define patterns to model sequential aspects behind functional properties.
They can be seen as consensus, as they capture some sequential
similarities of a given entry set of sequences.
Further informations can be found at:
http://www.expasy.org/prosite/prosuser.html
Let
be the set of all known amino-acids, abbreviated with respect to the standard
one-letter IUPAC code5.1 given by figure 5.1.2.2.
Figure 5.1:
Standard codes for the amino-acids, as proposed by the IUPAC
 |
Then a PROSITE pattern
can then be recursively defined as follows:
where
and
are PROSITE patterns;
,
and
are positive integers;
is a sequence of amino-acids encoded
with respect to the IUPAC code for the amino-acids. The semantics for the preceeding alternatives is described below :
- -
- The concatenation
: protein codes are derived from patterns
and
and concatenated.
- -
- The strict iteration
: a protein code is derived from pattern
and copied
times.
- -
- The loose iteration
: a protein code is derived from pattern
and copied from
to
times.
- -
- The identity
: This notation is equivalent to
, and simply means that a sequence is issued
from
. It can be used to resolve some ambiguities.
- -
- The inclusive disjonction
: Any amino-acid code can be choosed from the list
.
- -
- The exclusive disjonction
: Any amino-acid code that is not in the list
can be choosed.
- -
- The wildcard
: Any amino-acid, including the unknown
code.
- -
- The amino-acid sequence
:
is a word composed of amino-acid codes.
As the rational expressions are also defined using concatenations, iterations and disjunctions, a PROSITE pattern can be considered
as a rational expression. The correspondance between PROSITE elementary operations and rational ones are listed below:
Its is a classical result of the formal language theory that any regular expression can be
turned into an automaton that recognises/generates the same language. Such an automaton can be turned into
a deterministic one, that is an automaton where each accepted word is uniquely coupled with a walk between two
special states (start and final). Thus, instead of expanding a initial non-terminal symbol the way we did for
the GRAMMAR package, we are going to walk inside of this automaton with carefuly chosed probabilities.
For instance, consider the ORFs inside of DNA. They start with a START codon ATG, then follows a region of
codons that are not STOP-ones, and finally they end with the STOP codon TAA.
This can be modelled by the following expression, whose deterministic automaton is shown in figure 5.2.
Figure 5.2:
Deterministic automaton corresponding to the simple ORF model.
|
Each sequence is bijectively associated with a single path in the resulting. For instance, the sequence ATG AAT TCG GAT TAA
corresponds to the state sequence
-
-
-
-
-
-
-
-
-
-
. As explained in the GRAMMAR package, the choice of the
next state must be made using correct probabilities in order to achieve
uniformity. These probabilities are proportional to the amount of words
reachable from the destination state. Once again these numbers can be
computed using simple linear recurrences before the generation stage.
Controlled non-uniform random generation can also be achieved using the
same distributions as within the GRAMMAR package. Indeed, each
symbol(letter, IUPAC Code) can be associated with a weight inside the
WEIGHT clause, so that a sequence's probability will the product
of all its letters' weights, normalized by the sum of all the weights over
the sequence set denoted by the expression. This weight mechanism may be used for
instance to increase the proportion of a given base as well as to favor the
occurence of a given structured motif.
The main structure of the file is shown in figure 5.3
Figure 5.3:
Main structure of a RATIONAL description file
![\begin{figure}\begin{center}
\texttt{\begin{tabular}{l}
TYPE = RATIONAL \ [0....
...HTS] = ...\ [0.1cm]
[ALIASES = ...]
\end{tabular}}
\end{center} \end{figure}](img217.png.gif) |
LANGUAGE = RATIONAL | PROSITE
Required
Chooses between rational/regular expression and PROSITE pattern syntaxes for the expression.
EXPRESSION =
Required
The rational or PROSITE expression. The syntax for
depends on the value of the LANGUAGE clause,
and has always been described in section 5.1.1 for a RATIONAL choice, and in section
5.1.2.2 for PROSITE patterns.
Optional, all weights default to
.
Defines the weights of the terminal letters
.
As discussed in section 4.3 for the GRAMMAR description file, assigning a weight
to a terminal letter
is a way to gain control over the average frequency of
.
We show and explain the GenRGenS description file corresponding to the simple ORF model shown in
figure 5.2.
On line 1, we choose the RATIONAL package for random generation.
On line 2, we select a rational/regular expression.
On line 3, the rational expression corresponding to the simple ORF model from figure
5.2. Notice that the star operator * has highest precedence,
so that "A|B*"
"A|(B*)"
"(A|B)*".
On lines 4 and 5, weights are assigned to the C's and G's. A weight higher than
for a symbol
will increase its number of occurence within generated sequences. Here, we choose to favor C and G
among the coding area.
On lines 6 and 7, the symbols C' and G' were only introduced to allow the assignment of
weights within the coding area. Indeed, we needed the weight not to affect the G from the START codon.
We don't really need them anymore, so we send them back to their original representations C and G.
We illustrate the use of GenRGenS' RATIONAL package to generate protein sequences with respect
to a given PROSITE pattern by a real-life example. The pattern CBM1_1 can be found under
accession code PS00562 on expasy's and is considered as a signature for the carbohydrate
binding type-1 domain. Figure 5.4 shows its entry in the PROSITE database.
Figure 5.4:
PROSITE entry for pattern CBM1_1
 |
The PA line of the PROSITE file can be inserted directly into the
EXPRESSION clause, without the terminal dot, resulting
in the following description file:
On line 1, we choose the RATIONAL package for random generation.
On line 2, we select a PROSITE pattern expression.
On line 3 and 4, the PROSITE pattern.
The default behaviour of GenRGenS is to generate sequences of the size provided through the -size command line option.
However, it can also be useful to generate among every words possibly drawn from a PROSITE expression, regardless of the size.
This is the purpose of the -i option. The sequences are still drawn at random.
Rational specific command line options usage:
java -cp . GenRGenS.GenRGenS -size n -nb k -i [T
F] PrositeGGDFile
- -i [T
F]:When T is selected, ignores the size parameter, so that the sequences
are drawn at random among the finite set corresponding to the PROSITE pattern defined in the file
PrositeGGDFile. Generates sequences of the given size otherwise.
Defaults to -i F.
Sometimes, true uniformity or total control over the distribution
of the sequences is not needed, and the hierarchical aspects of
some models can be captured. The masterfile generation is then a good
tradeoff between computational efficiency and expressivity.
The main principle is to generalize the idea behind hidden Markovian
to any type of master and hidden states models.
At first, a sequence is derived from a master description file, having
a length passed to the main generation engine of GenRGenS. Then each
of its symbols can either or not be defined non-terminal and
rewritten using another description file coupled with a size distribution.
It is noticeable that this approach to random can save some time and
space while using models that require more than linear time and space
complexities. Indeed, if
and
are the sizes of the two parts
of an expected sequence then
, for
all
.
It also implies a small loss of control over the distribution of
sequences. For instance, a model built up by concatenating two
sequences chosen uniformly is unlikely to be uniform itself.
More generally, let
and
be two models, a sequence
issued
from the model
may admit different decompositions
.
Its global probability is then
,
which induces a bias if uniformity is aimed at.
Furthermore, as it is impossible to constrain the ending of a Markov
chain, the concatenation of two sequences issued from some Markovian
models may induce some bias at the cutpoint.
MASTER package introduces a way to capture the dependency between two subsequences' size.
A typical dependency that one may find useful would be : The total size of the two subsequence
is 100nt, although the former's has been observed to vary with respect to a given distribution.
The MASTER package offers the possibility to declare variables, which is a powerful and
delicate feature.
Inside of a MASTER description file, simply enclose any expression of a length definition inside
a VAR declaration. Ex.:
TYPE = MASTER |
MAINFILE = "MainFile.ggd" |
WHERE |
A : "AuxFile1.ggd" SIZE VAR(X,UNIFORM(0,5)+1) |
B : "AuxFile2.ggd" SIZE MAX(5-X,0) |
In the previous example, each evaluation of the random variate UNIFORM(0,5)+1 is stored
and used for the subsequent evaluations of MAX(5-X,0). As the sequence issued from
Mainfile.ggd is rewritten left to right, this may imply that several rewritings
of symbol B use the same value for X i.e., depend on the same rewriting of
A.
Figure 6.1:
Dependencies during generation stage
|
It can be seen in figure 6.1 that a variable's assignment can either be used by several subsequent
dependant variables assignments or be ignored. Therefore, this feature should only be used when the sequences
issued from MainFile.ggd are sufficiently constrained, for instance by showing an assignment/use alternation.
This section describes the syntax and semantics of MASTER description files.
Figure 6.2:
Main structure of a MASTER description file
 |
Clauses nested inside square brackets are optional. The given order for the clauses is mandatory.
A hierarchical model must define the way symbols from the main sequence are to be expanded.
SYMBOLS = {WORDS,LETTERS}
Optional, defaults to SYMBOLS = WORDS
Chooses the type of symbols to be used for random generation.
When WORDS is selected, each pair of subsequent symbols is separated with spaces during the
generation.
MAINFILE = "
"
Required
Defines the main random generation model description file. The file must be accessible by appending the
string
to the current directory.
A so-called master sequence is generated from this description file for further rewritings.
Its length equals to the size provided to GenRGenS.
Remark: The overall size of the sequence generated from a MASTER model is not related
to the size parameter provided to GenRGenS.
WHERE = |
: " " SIZE =  |
: " " SIZE = ... |
Required
Defines the way to expand each occurences of a symbol
from the main file to sequences issued from
,
having size given by the distribution
.
Each
is a symbol that must belong to the vocabulary of the master file, but not necessairly to the
sequence generated from it.
is the path to a description file that must be read-access enabled.
is a description of the random value associated to the size of this symbol's expansion.
A size can either be a constant, one of the predefined random variate generator,
or an arithmetic expression defined recursively.
- -
- integer or float constant: Generates a constant sequence size. If a non-integer size is specified at top-level,
the value is rounded to the closest integer value, as it would be have been using round.
- -
- +,-,*,/,^ : Classical binary arithmetic operators, acting on real numbers
and returning a real value.
- -
- (
): An expression nested by parenthesis is evaluated separately and priorly from its context.
It can be used to overrule usual operator precedences. Ex :
+
*
(
+
)*
- -
- pow(
,
): Functional form of the power operator ^ , returns
. If first argument
is
omitted, returns
.
- -
- log(
,
): Returns
. If
is omitted, returns
.
- -
- floor(
): Returns the closest integer value smaller than
.
- -
- round(
): Returns the closest integer to
.
- -
- ceil(
): Returns the closest integer value greater than
.
- -
- min(
,
): Returns the smallest number among {
}.
- -
- max(
,
): Returns the smallest number among {
}.
- -
- length: Returns the former length of the master sequence. Equivalent syntax: size or N.
- -
- normal(
,
): Generates a pseudorandom real value obeying a normal
law having mean
and standard deviation
. Equivalent syntax: gaussian(
,
).
- -
- uniform(
,
): Generates a pseudorandom integer value uniformly distributed over
(e.g.
).
If
is omitted, defaults to 0. Equivalent syntax: random(
,
).
- -
- var(
,
): Declares a variable
, whose value always equals the most recent evaluation of
.
Once declared,
can be used anywhere among the SIZE declaration parts of the ggd. Each wariable carries the value 0 as
long as it is not assigned.
This example is inspired by a simple model for stem-loop ribosomal frameshift sites. A frameshitfting RNA causes the ribosome to
slip over a specific sequence/structure motif and shift to another phase. A model inspired by the HIV-1 stem-loop
frameshifting site is detailled in figure 6.3.
Figure 6.3:
A simple stem-loop frameshifting model. X stands for any base,
Y means A or U and Z is anything but C
|
The Heptamer part can be modeled with a regular expression, similar to a mutation model. The Spacer will use a markov model. The
Stem_Loop requires the full power of a context free grammar. Lastly, the frame shift sites are drowned into an ocean of coding
RNA(symbol Fill) modelled again by a markov model.
Figure 6.4:
A MASTER description file for the frameshift model from figure 6.3.1
 |
Figure 6.5:
Auxiliary description files
 |
On line 1, a MASTER generator is choosed.
On line 2, we use words as symbols to avoid confusion.
On line 3, we define the main description file, whose content is listed in figure 6.5.
On line 5, we ask the generator to rewrite each occurence of the symbol Heptamer with a sequence
issued from heptamers.ggd, having size
.
On line 6, we ask the generator to rewrite each occurence of Spacer with a sequence
issued from spacers.ggd, using a size uniformly distributed over
. Note that
UNIFORM(5,8) is an equivalent syntax for 5 + UNIFORM(0,3). After each evaluation,
the resulting size is stored inside a variable X.
On line 7, we ask the generator to rewrite each occurence of Stem_Loop with a sequence
issued from stem_loops.ggd, using a size uniformly distributed over
. After each evaluation,
the resulting size is stored inside a variable Y.
On line 8, we ask the generator to rewrite each occurence of Fill using a size that depends on
the previous assignments.
The somehow cabalistic formulae for the size of Fill's expansion is just a little trick to ensure that
the Heptamer motif always occurs on phase 0. That is, NORMAL(300 , 100)+X+Y+7 must be a multiple
of
. So, by dividing by
and rounding to the closest smaller integer and then multiplying by
, we get the closest
sum of the sizes of Heptamer, Spacer, Stem_Loop and Fill that is divided by
without
a remainder. It suffices then to substract 7 (Heptamer's size), X (Spacer's size) and
Y (Stem_Loop's size), to get an elligible size for Fill.
- 1
-
Rambaut A. and Grassly N. C.
Seq-gen: An application for the monte carlo simulation of dna
sequence evolution along phylogenetic trees.
Journal of Computational Biology, (13):235-238, 1997.
- 2
-
A. Bairoch.
The PROSITE dictionary of sites and patterns in proteins, its
current status.
Nucleic Acids Research, 13(21):3097-3103, 1993.
- 3
-
B.A. Butler.
Sequence analysis using gcg.
Methods Biochem. Anal., (39):74-97, 1998.
- 4
-
E. Coward.
Shufflet: Shuffling sequences while conserving the
-let counts.
Bioinformatics, (15):1058-1059, 1999.
- 5
-
A. Denise, O. Roques, and M. Termier.
Random generation of words of context-free languages according to the
frequencies of letters.
In D. Gardy and A. Mokkadem, editors, Mathematics and Computer
Science: Algorithms, Trees, Combinatorics and probabilities, Trends in
Mathematics, pages 113-125. Birkhaüser, 2000.
- 6
-
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison.
Biological Sequence Analysis : Probabilistic Models of Proteins
and Nucleic Acids.
Cambridge University Press, July 1999.
- 7
-
Gasteiger E., Gattiker A., Hoogland C., Ivanyi I., Appel R.D., and Bairoch A.
Expasy: the proteomics server for in-depth protein knowledge and
analysis.
Nucleic Acids Res., (31):3784-3788, 2003.
- 8
-
P. Flajolet, P. Zimmermann, and B. Van Cutsem.
A calculus for the random generation of labelled combinatorial
structures.
Theoretical Computer Science, 132:1-35, 1994.
A preliminary version is available in INRIA Research Report RR-1830.
- 9
-
Reinert G., Schbath S., and Waterman M.S.
Probabilistic and statistical properties of words: An overview.
Journal of Computational Biology, 1-2(7):1-46, 2000.
- 10
-
M. Nebel.
Combinatorial properties of rna secondary structures.
Journal of Computational Biology, 3(9):541-574, 2003.
- 11
-
Y. Ponty.
Etudes combinatoire et génération aléatoire des structures
secondaires d'ARN.
Master's thesis, Université Paris Sud, 2003.
Mémoire de DEA.
- 12
-
D.B. Searls.
Formal language theory and biological macromolecules.
Series in Discrete Mathematics and Theoretical Computer
Science, 47:117-140, 1999.
GenRGenS v2.0 User Manual
This document was generated using the
LaTeX2HTML translator Version 2K.1beta (1.48)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 1 -local_icons -antialias_text -antialias GRGs-manual-html.tex -dir GRGs-manual-html-SoftVersion
The translation was initiated by Yann Ponty on 2006-02-21
Yann Ponty
2006-02-21