Jnlp version of SNK implemented by
Bastien Rance - Java 6.0 might be required
SNK in JNLP
Publication
Using Applet
- To use the demo, you have to accept our security certificate.
- If you cannot see the "Demo button", you probably need to download Sun Java Machine.
SNK: User Manual
Contents
1 What is SNK?
2 SNK in a nutshell
3 SNK input
3.1 Protein domain studies
3.2 General use
4 SNK output
5 Advanced users, Parameters
6 Other Functionalities
6.1 Basic Functionalities
6.2 Visualizing SNK output
1 What is SNK?
Given a set of proteins and their domain architecture, SNK allows you to discover dependencies between domain sequence and some specified target. These dependencies can be visualised using SNK-DeeVee.
If your research questions are:
- Is there a specific signature of domains of proteins of a given family ?
- Is there a dependency between the sequence Domain1, Domain2 ordered this way and the family1 ?
SNK may help you.
2 SNK in a nutshell
SNK is a tool that aims to discover dependencies between the descriptions of objects in terms of sequences of items, and some specific target item. These dependencies have been introduced in [
Froidevaux et al., 2007] and called Sequential Nuggets of Knowledge. They are expressed by weighted rules as follows:
descriptor1,descriptor2,..., descriptork, - > target (a,b).
- The first value a is a support value that specifies how many objects of the database have in their description: descriptor1 followed by descriptor2, ... followed by descriptork (descriptors have to be ordered this way, but do not need to be strictly contiguous in the full description of the object).
- The second value is an interestingness measure (IM) value that determines the relevance of the rule.
As shown in [
Rance et al., 2008], it can be very helpful in molecular biology. This tool is general enough and can be used more broadly for the analysis of any sequential data, such as web usage mining.
The underlying algorithm presented in [
Rance et al., 2007] searches for Sequential Nuggets of Knowledge whose consequents belong to some predefined set of items (target items) and satisfy user-specified support and IM thresholds.
- Support can be low if the user is interested in discovering close dependencies between facts that almost co-occur even these facts are not frequent. However minimal support is required to avoid discovering strong associations involving only a few objects, which may come from noise.
- Several interestingness measures (IM) are considered since not all measures are equally good at capturing dependencies between facts and no measure is better than others in all cases. Rules are expected to be strongly relevant, so IM threshold should be high.
- Due to the huge number of rules that could be generated, not all Sequential Nuggets of Knowledge are searched for.
* (c-SNoKs) SNK can calculate all the shortest rules of length greater than or equal to some user-defined minimal size k. It means that any rule R yielded by SNK has at least k descriptors in its left hand side, and that there is no other rule with at least k descriptors and whose set of descriptors could be strictly included in the one of R. Minimal length can be required since too short rules are often not significant. If k is set to 1 the sequential nuggets of knowledge provided by SNK correspond to concise characteristics of the objects with respect to the given target.
* (s-SNoKs) SNK can also generate all the sequential nuggets of length less than or equal to some user-defined maximal bound. They correspond to common characteristics of the objects in relation to some target. If a too high maximal bound is chosen, efficiency of the tool can be reduced.
3 SNK input
Dedicated functionalities are provided for the case where the sequential data are protein domain architectures (next subsection). The general case is described in subsection
3.2.
3.1 Protein domain studies
Input data can be extracted from the Pfam data bank
http://pfam.sanger.ac.uk/ [
Bateman et al., 2006].
SNK uses Pfam "Domain organisation" data, where the different architectures available for a family are shown.
Example of Pfam output:
There are 137 sequences with the following architecture: MACPF CO8B_ONCMY [ oncorhynchus mykiss (rainbow trout) (salmo gairdneri)] complement component c8 beta chain precursor (complement component 8subunit beta) (587 residues)
Hide all sequences with this architecture. Show all sequences with this architecture.
Loading all sequences...
There are 20 sequences with the following architecture: TSP_1, Ldl_recept_a, MACPF, TSP_1 CO8A_HUMAN [ homo sapiens (human)] complement component c8 alpha chain precursor (complement component 8subunit alpha) (584 residues)
Hide all sequences with this architecture. Show all sequences with this architecture.
Copy/Paste Pfam output into SNK text-field (9 on the figure).
Now click on the "scan data" button (3) to translate Pfam output into SNK entry format. Pfam shows the number of proteins that have the same architecture. The scan process will add a sequence for each protein. If you want the architecture to appear only once, please check the "one sequence per architecture" check-box.
Every sequence must be manually associated to one annotation (e.g. function). This annotation is described by an additional line above the sequence to which the annotation is applied. This line begins by a double slash.
The annotation may be added at any moment of the input process.
In the following example, the first two sequences are associated with target 1 and the two others with target 2. SNK will search for Sequential Nuggets of Knowledge associated to each target (//).
// Target1
Retrotrans_gag zf-CCHC RVP_2 RVT_1
rve Chromo Chromo Chromo_shadow
//Target2
rve Chromo
RVT_1 rve Chromo
The lines "// Target1" and "// Target2" have been added manually by a human expert, before or after the "scan data" step.
When this step has been done, adjust the parameters to your own need and start SNK (button 4).
3.2 General use
SNK generates association rules using an input set of sequences in special format.
SNK considers a sequence by line. A given item of the sequence is one suite of characters without any space. Every sequence must be associated to one annotation (e.g. function). This annotation is described by an additional line above the sequence to which the annotation is applied. This line begins by a double slash.
4 SNK output
SNK returns:
*(Case of c-SNoKs) all the minimum sequential nuggets of knowledge with respect to user's parameters, formalized as sequential association rules.
*(Case of s-SNoKs) all the sequential nuggets of knowledge smaller than a size threshold respect to user's parameters, formalized as sequential association rules.
SNK output is sorted by IM value, then by support, then by left hand side length and finally by alphanumeric order.
5 Advanced users, Parameters
SNK allows the user to configure different parameters: support and interestingness measure (IM) thresholds, interestingness measure and rules minimum size/maximal bound (5, 6, 7 and 8 on figure
1)
- Interestingness measure (IM)
-
This measure defines the quality and strength of the association between a sequence and the target. Confidence is one of the most used interestingness measures. Several interestingness measures are considered since not all measures are equally good at capturing dependencies between facts, and no measure is best in all cases. An SNK applet offers a choice of ten standard measures. The user can be guided in his selection of the right measure that best suits the data by examining a number of key properties given in [Tan et al., 2002]. By default, confidence is chosen.
- Support threshold
-
This threshold is defined for all the rules. It specifies the minimum number of proteins in the database that must share the description given by a rule, that is, the domains occurring in the order specified in the left hand side of the rule together with the target item of the right hand side of the rule. It is a proportion (between 0 and 1) of the total number of proteins of the database. Support can be low but minimal support is required in order to avoid associations that involve too few proteins and result from noise.
- IM threshold
-
This threshold defines the minimal quality required for a rule. The higher the threshold, the better the rule is. For "Confidence", a value of 0.8 is usually considered as high.
- Rule minimum size (case of c-SNoKs)
-
The minimum size of the generated rules specifies the minimal number of domains expected in the left hand side of the rule. All the c-SNoKs of size greater than or equal to this size threshold will be generated.
- Rule maximal bound (case of s-SNoKs)
-
The maximal bound specifies the maximal number of domains expected in the left hand side of the rule. All the s-SNoKs from length 1 to maximal bound will be generated.
When this step has been done, adjust the parameters to your own need and start SNK (button 4).
6 Other Functionalities
6.1 Basic Functionalities
SNK allows to use two functionalities not directly linked to the SNK algorithm.
- "Back": the "back" function allows to go back to the previous screen (be careful you can only go back of 1 step). To use this function, use the "back" button (10).
- "Replace": the key combination ctr-F in the main text-field opens a "replace frame". In the field "replace", insert a word or a regular expression (see the exact syntax in the appendix). All the occurrences of the expression will be replaced by the value of the "by" field.
- "Pfam query with proteins ID": this button allows the user to get proteins domain architecture from Pfam. Simply copy/paste the proteins UniProt ID and click on this button.
6.2 Visualizing SNK output
Using or analysing the rules mined by SNK is not always easy.
We propose DeeVee as a solution to visualise and analyse SNK output. DeeVee is a simple protein's domain viewer and is connected to SNK.
DeeVee uses Pfam output to create the main window. To open DeeVee main window, click on the "view proteins" button (11) on SNK windows.
The user can click on the "Export rules" button (12) of SNK to open the DeeVee "Rules" window. This window allows to select a rule and display the proteins with sequence of domains in the same order than the left hand side of a given rule.
References
- [Bateman et al., 2006]
- R.D.Finn, J.Mistry, B.Schuster-Böckler, S.Griffiths-Jones, V.Hollich, T.Lassmann, S.Moxon, M.Marshall, A.Khanna, R.Durbin, S.R.Eddy, E.L.L.Sonnhammer, A.Bateman (2006) Pfam: clans, web tools and services, Nucleic Acids Research Database Issue 34:D247-D251
- [Froidevaux et al., 2007]
- Froidevaux,C., Lisacek,F., Rance,B. (2007) Extracting Sequential Nuggets of Knowledge, Proc. of DEXA'07, LNCS 4653 740-750.
- [Rance et al., 2007]
- Rance,B., Lisacek,F., Froidevaux,C. (2007) An algorithm for Mining Minimal Sequential Nuggets of Knowledge. LRI Technical Report 1476, October 2007.
- [Rance et al., 2008]
- Rance,B., Lisacek,F., Froidevaux,C. (2008) SNK: a new method for mining sequential nuggets of knowledge from protein families, under submission.
- [Tan et al., 2002]
- Tan,P.N., Kumar,V., Srivastava,J. (2002) Selecting the Right Interestingness Measure for Association Patterns, SIGKDD'02
File translated from
TEX
by
TTH,
version 3.80.
On 8 Apr 2008, 10:46.
Appendix
See
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html for exact regular expression syntax.