SNK: Mining Sequential Nuggets of Knowledge
This applet has been implemented by Bastien Rance

You probably need to install Java

Please do not hesitate to contact us : bastien [dot] rance [at] lri [dot] fr


SNK

Jnlp version of SNK implemented by Bastien Rance - Java 6.0 might be required
SNK in JNLP

Publication

Using Applet


SNK: User Manual

Contents

1  What is SNK?
2  SNK in a nutshell
3  SNK input
    3.1  Protein domain studies
    3.2  General use
4  SNK output
5  Advanced users, Parameters
6  Other Functionalities
    6.1  Basic Functionalities
    6.2  Visualizing SNK output

1  What is SNK?

Given a set of proteins and their domain architecture, SNK allows you to discover dependencies between domain sequence and some specified target. These dependencies can be visualised using SNK-DeeVee.
If your research questions are: SNK may help you.

2  SNK in a nutshell

SNK is a tool that aims to discover dependencies between the descriptions of objects in terms of sequences of items, and some specific target item. These dependencies have been introduced in [Froidevaux et al., 2007] and called Sequential Nuggets of Knowledge. They are expressed by weighted rules as follows:
descriptor1,descriptor2,..., descriptork, - > target (a,b).
As shown in [Rance et al., 2008], it can be very helpful in molecular biology. This tool is general enough and can be used more broadly for the analysis of any sequential data, such as web usage mining.
The underlying algorithm presented in [Rance et al., 2007] searches for Sequential Nuggets of Knowledge whose consequents belong to some predefined set of items (target items) and satisfy user-specified support and IM thresholds.

3  SNK input

Dedicated functionalities are provided for the case where the sequential data are protein domain architectures (next subsection). The general case is described in subsection 3.2.

3.1  Protein domain studies

Input data can be extracted from the Pfam data bank http://pfam.sanger.ac.uk/ [Bateman et al., 2006].
SNK uses Pfam "Domain organisation" data, where the different architectures available for a family are shown.
Example of Pfam output:
There are 137 sequences with the following architecture: MACPF CO8B_ONCMY [ oncorhynchus mykiss (rainbow trout) (salmo gairdneri)] complement component c8 beta chain precursor (complement component 8subunit beta) (587 residues)

Hide all sequences with this architecture. Show all sequences with this architecture.
Loading all sequences...
There are 20 sequences with the following architecture: TSP_1, Ldl_recept_a, MACPF, TSP_1 CO8A_HUMAN [ homo sapiens (human)] complement component c8 alpha chain precursor (complement component 8subunit alpha) (584 residues)
 
Hide all sequences with this architecture. Show all sequences with this architecture.

Copy/Paste Pfam output into SNK text-field (9 on the figure).

Now click on the "scan data" button (3) to translate Pfam output into SNK entry format. Pfam shows the number of proteins that have the same architecture. The scan process will add a sequence for each protein. If you want the architecture to appear only once, please check the "one sequence per architecture" check-box.

Every sequence must be manually associated to one annotation (e.g. function). This annotation is described by an additional line above the sequence to which the annotation is applied. This line begins by a double slash.
The annotation may be added at any moment of the input process.
In the following example, the first two sequences are associated with target 1 and the two others with target 2. SNK will search for Sequential Nuggets of Knowledge associated to each target (//).

// Target1
Retrotrans_gag zf-CCHC RVP_2 RVT_1
rve Chromo Chromo Chromo_shadow
//Target2
rve Chromo
RVT_1 rve Chromo

The lines "// Target1" and "// Target2" have been added manually by a human expert, before or after the "scan data" step.

When this step has been done, adjust the parameters to your own need and start SNK (button 4).

3.2  General use

SNK generates association rules using an input set of sequences in special format.
SNK considers a sequence by line. A given item of the sequence is one suite of characters without any space. Every sequence must be associated to one annotation (e.g. function). This annotation is described by an additional line above the sequence to which the annotation is applied. This line begins by a double slash.

4  SNK output

SNK returns:
*(Case of c-SNoKs) all the minimum sequential nuggets of knowledge with respect to user's parameters, formalized as sequential association rules.
*(Case of s-SNoKs) all the sequential nuggets of knowledge smaller than a size threshold respect to user's parameters, formalized as sequential association rules.
SNK output is sorted by IM value, then by support, then by left hand side length and finally by alphanumeric order.

5  Advanced users, Parameters

SNK allows the user to configure different parameters: support and interestingness measure (IM) thresholds, interestingness measure and rules minimum size/maximal bound (5, 6, 7 and 8 on figure 1)
Interestingness measure (IM)
This measure defines the quality and strength of the association between a sequence and the target. Confidence is one of the most used interestingness measures. Several interestingness measures are considered since not all measures are equally good at capturing dependencies between facts, and no measure is best in all cases. An SNK applet offers a choice of ten standard measures. The user can be guided in his selection of the right measure that best suits the data by examining a number of key properties given in [Tan et al., 2002]. By default, confidence is chosen.
Support threshold
This threshold is defined for all the rules. It specifies the minimum number of proteins in the database that must share the description given by a rule, that is, the domains occurring in the order specified in the left hand side of the rule together with the target item of the right hand side of the rule. It is a proportion (between 0 and 1) of the total number of proteins of the database. Support can be low but minimal support is required in order to avoid associations that involve too few proteins and result from noise.
IM threshold
This threshold defines the minimal quality required for a rule. The higher the threshold, the better the rule is. For "Confidence", a value of 0.8 is usually considered as high.
Rule minimum size (case of c-SNoKs)
The minimum size of the generated rules specifies the minimal number of domains expected in the left hand side of the rule. All the c-SNoKs of size greater than or equal to this size threshold will be generated.
Rule maximal bound (case of s-SNoKs)
The maximal bound specifies the maximal number of domains expected in the left hand side of the rule. All the s-SNoKs from length 1 to maximal bound will be generated.
When this step has been done, adjust the parameters to your own need and start SNK (button 4).

6  Other Functionalities

6.1  Basic Functionalities

SNK allows to use two functionalities not directly linked to the SNK algorithm.

6.2  Visualizing SNK output

Using or analysing the rules mined by SNK is not always easy. We propose DeeVee as a solution to visualise and analyse SNK output. DeeVee is a simple protein's domain viewer and is connected to SNK. DeeVee uses Pfam output to create the main window. To open DeeVee main window, click on the "view proteins" button (11) on SNK windows. The user can click on the "Export rules" button (12) of SNK to open the DeeVee "Rules" window. This window allows to select a rule and display the proteins with sequence of domains in the same order than the left hand side of a given rule.

References

[Bateman et al., 2006]
R.D.Finn, J.Mistry, B.Schuster-Böckler, S.Griffiths-Jones, V.Hollich, T.Lassmann, S.Moxon, M.Marshall, A.Khanna, R.Durbin, S.R.Eddy, E.L.L.Sonnhammer, A.Bateman (2006) Pfam: clans, web tools and services, Nucleic Acids Research Database Issue 34:D247-D251
[Froidevaux et al., 2007]
Froidevaux,C., Lisacek,F., Rance,B. (2007) Extracting Sequential Nuggets of Knowledge, Proc. of DEXA'07, LNCS 4653 740-750.
[Rance et al., 2007]
Rance,B., Lisacek,F., Froidevaux,C. (2007) An algorithm for Mining Minimal Sequential Nuggets of Knowledge. LRI Technical Report 1476, October 2007.
[Rance et al., 2008]
Rance,B., Lisacek,F., Froidevaux,C. (2008) SNK: a new method for mining sequential nuggets of knowledge from protein families, under submission.
[Tan et al., 2002]
Tan,P.N., Kumar,V., Srivastava,J. (2002) Selecting the Right Interestingness Measure for Association Patterns, SIGKDD'02



File translated from TEX by TTH, version 3.80.
On 8 Apr 2008, 10:46.

Appendix

See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html for exact regular expression syntax.

Laboratoire de Recherche en Informatique - Equipe de Bioinformatique