next up previous contents
Next: The MASTER package : Up: GenRGenS v2.0 User Manual Previous: The GRAMMAR package: Context-Free   Contents


The RATIONAL package : Regular Expressions and PROSITE patterns

At the bottom of Chomsky's formal languages hierarchy lies the regular expressions. Taken as languages theory tools, their main purpose is the description of subsets over the whole set of sequences that can be parsed and/or generated by very simple machines called finite state automata over finite alphabets. This class is known to be a subclass of the one that is implemented by the context-free grammar, but more efficient random generation algorithms are available for it. Because of their simplicity, they have been extensively used to model families of genomic sequences differing only on a few positions. In the Prosite database, which is well known database of protein families and domains, a pattern (or consensus) is built from a given set of sequences, sharing functional properties. These so-called Prosite patterns are related to regular expressions, as shown in 5.1.3.
GenRGenS provides a random generation process both for Prosite patterns and regular expressions. For instance, random sequences drawn with respect to a Prosite pattern supposedly being a fingerprint for a biological property can be used to test its relevance. One can also generate simple mutants from a regular expression by introducing some choice places inside the sequence. Such sequences can used for computation of statistical scores, such as Z-Scores, and P-values.

Some theory

In this chapter, we will describe the syntax and semantics of the regular expressions. We will then explain how GenRGenS turns a ProSite pattern into a regular expression associated with the same language, and show how uniform random generation can be performed from a regular expression.

Regular expressions syntax

A regular expression $ e$ is a language description tool, recursively defined as follows :

$\displaystyle e = \left\{ \begin{array}{c} e'^{\texttt{*}} \\ e'\ \texttt{\vert...
...} \ e'' \\ \texttt{(}e'\texttt{)} \\ \omega \\ \varepsilon \end{array} \right. $

Where $ e_1$ and $ e_2$ are regular expressions, and $ l$ stands for any letter among the alphabet.
The $ \varepsilon$ is a shortcut for an empty word(a word of size 0).

Formally, a language can be seen as a set of words over a given alphabet. The meanings of these alternatives are related to the languages denoted by such expressions.
The disjonction $ e'\ \vert\ e''$: The language is the union of the languages associated to $ e'$ and $ e''$. Any word of $ e'\ \vert\ e''$ belongs to $ e'$ or $ e''$.
The concatenation $ e'\ .\ e''$: The language is the concatenation of the languages associated to $ e'$ and $ e''$. Any word among $ e'\ .\ e''$ can be decomposed into a concatenation of words issued from $ e'$ and $ e''$.
The iteration $ e^*$: Each word of the resulting language is a concatenation of a finite set of words issued from $ e$.
$ (e')$: The language is that of $ e'$. This construction is useful to avoid ambiguity. For instance, the expression $ a\vert b.c$ can denote the language $ \{a,bc\}$ or the language $ \{ac,bc\}$.
$ \omega,\varepsilon$: The only word among the language is the single character $ l$, resp. $ \varepsilon$.
Example 1:

$\displaystyle e=(A.T.G).((A\vert T\vert G\vert C).(A\vert T\vert G\vert C).(A\vert T\vert G\vert C))^*.(T.A.G\vert T.G.A\vert T.A.A)$

The regular expression $ e$ describes a simplistic model for an ORF, that is a subsequence of a DNA code starting with a START base triplet $ ATG$ and ending with one of the STOP base triplets TAG/TGA/TAA. It should be noticed that anything can happen between the START and STOP codon, as the $ ((A\vert T\vert G\vert C).(A\vert T\vert G\vert C).(A\vert T\vert G\vert C))^*$ part of $ e$ doesn't excludes STOP base triplets.

Example 2:
Such expressions are also perfectly fit to model mutants. Suppose you're given three 5.8S ribosomal RNA sequences close one to another and aligned as follows.
... C G C C C C G C C G G C G G ...
... A C G C G A C C C G G U G G ...
... C C U G U U . G U G G U G G ...
A regular expression $ e$ can then be used to model a family of sequences that contains the three original sequences, among with some other sequences close to the originals.

$\displaystyle e=\cdots(C\vert A).(C\vert G).(C\vert G\vert U).(C\vert G).(C\ver...
...t U).(G\vert C\vert\varepsilon).(C\vert G).(C\vert U).G.G.(C\vert U).G.G\cdots $

PROSITE Patterns

What are PROSITE Patterns ?

PROSITE is a database of protein families and domains.It consists of biologically significant sites, patterns and profiles that help to identify to which known protein family a new sequence belongs. It was started in 1989 by Amos Bairoch, and is part of the SWISS-PROT program[2]. It features text-files structured by tags and carriage-returns. Among these tags, some are used to define patterns to model sequential aspects behind functional properties. They can be seen as consensus, as they capture some sequential similarities of a given entry set of sequences.
Further informations can be found at:


Let $ \mathcal{P}$ be the set of all known amino-acids, abbreviated with respect to the standard one-letter IUPAC code5.1 given by figure
Figure 5.1: Standard codes for the amino-acids, as proposed by the IUPAC
\begin{tabular}{\vert c\vert c\vert} \hline
...Z \\
Unknown & X \\ \hline
Then a PROSITE pattern $ p$ can then be recursively defined as follows:

$\displaystyle p = \left\{ \begin{array}{c}
\texttt{x} \\
l \end{array} \right. $

where $ p'$ and $ p''$ are PROSITE patterns; $ n$, $ n_{min}$ and $ n_{max}$ are positive integers; $ l$ is a sequence of amino-acids encoded with respect to the IUPAC code for the amino-acids. The semantics for the preceeding alternatives is described below :
The concatenation $ p'\texttt{-}p''$: protein codes are derived from patterns $ p'$ and $ p''$ and concatenated.
The strict iteration $ p'\texttt{(}n\texttt{)}$: a protein code is derived from pattern $ p'$ and copied $ n$ times.
The loose iteration $ p'\texttt{(}n_{min}\texttt{,}n_{max}\texttt{)}$: a protein code is derived from pattern $ p'$ and copied from $ n_{min}$ to $ n_{max}$ times.
The identity $ \texttt{(}p'\texttt{)}$: This notation is equivalent to $ p'$, and simply means that a sequence is issued from $ p'$. It can be used to resolve some ambiguities.
The inclusive disjonction $ \texttt{[}l\texttt{]}$: Any amino-acid code can be choosed from the list $ l$.
The exclusive disjonction $ \texttt{\{}l\texttt{\}}$: Any amino-acid code that is not in the list $ l$ can be choosed.
The wildcard $ \texttt{x}$: Any amino-acid, including the unknown $ X$ code.
The amino-acid sequence $ l$: $ l$ is a word composed of amino-acid codes.

PROSITE patterns/Rational expressions relationship

As the rational expressions are also defined using concatenations, iterations and disjunctions, a PROSITE pattern can be considered as a rational expression. The correspondance between PROSITE elementary operations and rational ones are listed below:

$\displaystyle \begin{array}{rcl}
p'\texttt{-}p'' & \Rightarrow& e(p')\texttt{...
...k_i \in \mbox{IUPACCodes} \bigcup \{X\}\\
l & \Rightarrow & l

Uniform random generation among the language denoted by a Regular Expression

Its is a classical result of the formal language theory that any regular expression can be turned into an automaton that recognises/generates the same language. Such an automaton can be turned into a deterministic one, that is an automaton where each accepted word is uniquely coupled with a walk between two special states (start and final). Thus, instead of expanding a initial non-terminal symbol the way we did for the GRAMMAR package, we are going to walk inside of this automaton with carefuly chosed probabilities.

For instance, consider the ORFs inside of DNA. They start with a START codon ATG, then follows a region of codons that are not STOP-ones, and finally they end with the STOP codon TAA. This can be modelled by the following expression, whose deterministic automaton is shown in figure 5.2.

$\displaystyle (ATG).( T.A.(C\vert G\vert T) \vert T.(C\vert G\vert T).(A\vert C...
...rt (C\vert G\vert T).(A\vert C\vert G\vert T).(A\vert C\vert G\vert T))^*.(TAA)$

Figure 5.2: Deterministic automaton corresponding to the simple ORF model.
Image autORF
Each sequence is bijectively associated with a single path in the resulting. For instance, the sequence ATG AAT TCG GAT TAA corresponds to the state sequence $ 1$-$ 2$-$ 3$ $ 7$-$ 8$-$ 9$ $ 5$-$ 8$-$ 9$ $ 7$-$ 8$-$ 9$ $ 5$-$ 6$-$ 10$. As explained in the GRAMMAR package, the choice of the next state must be made using correct probabilities in order to achieve uniformity. These probabilities are proportional to the amount of words reachable from the destination state. Once again these numbers can be computed using simple linear recurrences before the generation stage.

Controlled non-uniform random generation can also be achieved using the same distributions as within the GRAMMAR package. Indeed, each symbol(letter, IUPAC Code) can be associated with a weight inside the WEIGHT clause, so that a sequence's probability will the product of all its letters' weights, normalized by the sum of all the weights over the sequence set denoted by the expression. This weight mechanism may be used for instance to increase the proportion of a given base as well as to favor the occurence of a given structured motif.

Implementing a Rational/PROSITE model

Main structure

The main structure of the file is shown in figure 5.3
Figure 5.3: Main structure of a RATIONAL description file
.....\\ [0.1cm]
[ALIASES = ...]

RATIONAL generation specific clauses

The LANGUAGE clause

Chooses between rational/regular expression and PROSITE pattern syntaxes for the expression.


The rational or PROSITE expression. The syntax for $ e$ depends on the value of the LANGUAGE clause, and has always been described in section 5.1.1 for a RATIONAL choice, and in section for PROSITE patterns.

The WEIGHTS clause

WEIGHTS = $ l_1$ $ w_1$ $ l_2$ $ w_2$ ...
$ w_i\in\mathbb{R}$
Optional, all weights default to $ 1$.
Defines the weights of the terminal letters $ l_i$.
As discussed in section 4.3 for the GRAMMAR description file, assigning a weight $ w_i$ to a terminal letter $ l_i$ is a way to gain control over the average frequency of $ l_i$.


RATIONAL style example


We show and explain the GenRGenS description file corresponding to the simple ORF model shown in figure 5.2.
\texttt{\begin{tabular}{\vert c\vert l\vert} \...
...\bf 7} &\quad C'=C G'=G \\ \hline


On line 1, we choose the RATIONAL package for random generation.

On line 2, we select a rational/regular expression.

On line 3, the rational expression corresponding to the simple ORF model from figure 5.2. Notice that the star operator * has highest precedence, so that "A|B*" $ \Leftrightarrow$"A|(B*)"$ \neq$"(A|B)*". On lines 4 and 5, weights are assigned to the C's and G's. A weight higher than $ 1$ for a symbol will increase its number of occurence within generated sequences. Here, we choose to favor C and G among the coding area.

On lines 6 and 7, the symbols C' and G' were only introduced to allow the assignment of weights within the coding area. Indeed, we needed the weight not to affect the G from the START codon. We don't really need them anymore, so we send them back to their original representations C and G.

PROSITE style example

We illustrate the use of GenRGenS' RATIONAL package to generate protein sequences with respect to a given PROSITE pattern by a real-life example. The pattern CBM1_1 can be found under accession code PS00562 on expasy's and is considered as a signature for the carbohydrate binding type-1 domain. Figure 5.4 shows its entry in the PROSITE database.
Figure 5.4: PROSITE entry for pattern CBM1_1
\scriptsize {
ID CBM1_1; PAT...
... 1AZK; 1CBH; 2CBH;
DO PDOC00486;\end{verbatim}


The PA line of the PROSITE file can be inserted directly into the EXPRESSION clause, without the terminal dot, resulting in the following description file:
\texttt{\begin{tabular}{\vert c\vert l\vert} \...
...[NHGS]-x-[FYWM]-x(2)-Q-C\\ \hline


On line 1, we choose the RATIONAL package for random generation.

On line 2, we select a PROSITE pattern expression.

On line 3 and 4, the PROSITE pattern.

Rational expressions-specific command line options

The default behaviour of GenRGenS is to generate sequences of the size provided through the -size command line option. However, it can also be useful to generate among every words possibly drawn from a PROSITE expression, regardless of the size. This is the purpose of the -i option. The sequences are still drawn at random.

Rational specific command line options usage:
java -cp . GenRGenS.GenRGenS -size n -nb k -i [T$ \mid$F] PrositeGGDFile

next up previous contents
Next: The MASTER package : Up: GenRGenS v2.0 User Manual Previous: The GRAMMAR package: Context-Free   Contents
Yann Ponty 2007-04-19