- Some theory
- Regular expressions syntax
- PROSITE Patterns
- PROSITE patterns/Rational expressions relationship
- Uniform random generation among the language denoted by a Regular Expression

- Implementing a Rational/PROSITE model

- Examples

- Rational expressions-specific command line options

GenRGenS provides a random generation process both for Prosite patterns and regular expressions. For instance, random sequences drawn with respect to a Prosite pattern supposedly being a fingerprint for a biological property can be used to test its relevance. One can also generate simple mutants from a regular expression by introducing some

Regular expressions syntax

The is a shortcut for an empty word(a word of size 0).

Formally, a language can be seen as a set of words over a given alphabet. The meanings of these alternatives are related to the languages denoted by such expressions.

- -
- The disjonction : The language is the union of the languages associated to and . Any word of belongs to or .
- -
- The concatenation : The language is the concatenation of the languages associated to and . Any word among can be decomposed into a concatenation of words issued from and .
- -
- The iteration : Each word of the resulting language is a concatenation of a finite set of words issued from .
- -
- : The language is that of . This construction is useful to avoid ambiguity. For instance, the expression can denote the language or the language .
- -
- : The only word among the language is the single character , resp. .

Such expressions are also perfectly fit to model mutants. Suppose you're given three 5.8S ribosomal RNA sequences close one to another and aligned as follows.

... | C | G | C | C | C | C | G | C | C | G | G | C | G | G | ... |

... | A | C | G | C | G | A | C | C | C | G | G | U | G | G | ... |

... | C | C | U | G | U | U | . | G | U | G | G | U | G | G | ... |

Further informations can be found at:

Then a PROSITE pattern can then be recursively defined as follows:

- -
- The concatenation : protein codes are derived from patterns and and concatenated.
- -
- The strict iteration : a protein code is derived from pattern and copied times.
- -
- The loose iteration : a protein code is derived from pattern and copied from to times.
- -
- The identity : This notation is equivalent to , and simply means that a sequence is issued from . It can be used to resolve some ambiguities.
- -
- The inclusive disjonction : Any amino-acid code can be choosed from the list .
- -
- The exclusive disjonction : Any amino-acid code that is not in the list can be choosed.
- -
- The wildcard : Any amino-acid, including the unknown code.
- -
- The amino-acid sequence : is a word composed of amino-acid codes.

For instance, consider the ORFs inside of DNA. They start with a START codon

Controlled non-uniform random generation can also be achieved using the same distributions as within the

Chooses between rational/regular expression and PROSITE pattern syntaxes for the expression.

The rational or

Defines the weights of the terminal letters .

As discussed in section 4.3 for the

On line

On line

On lines

On line

On line

:When**-i [TF]**is selected, ignores the**T***size*parameter, so that the sequences are drawn at random among the finite set corresponding to the`PROSITE`pattern defined in the file*PrositeGGDFile*. Generates sequences of the given*size*otherwise.

**Defaults to -i F.**