(DEBUG version)
Version 2.0
SMARTIV Manual
Species and Genome assembly: SMARTIV supports input sequences, extracted from binding experiments performed on the following species and genome assemblies:
Input file format: SMARTIV gets a list of genomic coordinates in BED format (view example) or a list of sequences in FASTA format (view example). In case both formats are available, we recommend providing the .BED.
The BED file should have the following columns values:
  • chromosome name in the 1st column,
  • starting and ending position in the 2nd and 3rd columns,
  • and strand in the 6th column.
The list of sequences should be sorted by sequence binding score in descending order (i.e. sequences on the top must have higher binding signal/noice ratio than on the bottom). By default, input sequences should be sorted by user, but additional option "enable SMARTIV sorting for .bed sequences" is provided. Using this option is recommended for specific commonly used formats: BED 6-column view example) and ENCODE narrowPeak format (view example). If automatic sorting optionis used, the input file should contain sequence binding score values ("score" in 5-th column for BED 6-column format, or "signalValue" and "pValue" for ENCODE narrowPeak format).
  • File in BED 6-column format will be sorted by 5th column in descending order,
  • file in ENCODE narrowPeak format will be sorted by 8th (primary) and 7th column (secondary) in descending order.
The FASTA format includes
  • A header line that starts with '>'
  • Followed by a line containing a sequence.
Unless the header contains the coordinates of the sequence, SMARTIV will use BLAT to find the coordinates of the sequence in the assembly.
The coordinates are extracted only from FASTA headers that contain:
  • Chromosome (e.g. chr12 or chrM)
  • Start and end position of the sequence, separated by dash (e.g. 30096589-30096634)
  • Strand (+ or -)
  • and optionally: the binding score (e.g. 12.56)
These fields are separated either by tabs, colons ':', or spaces, ' '. Any other FASTA header is ignored.
NOTE: BLAT may fail to find the coordinates of a sequence, even though the sequence is derived from the assembly. When this happens, the sequence is ignored. It is possible that due to this, the number of valid sequences will drop below 2000 and SMARTIV will report an error.
Sample data: Clicking on the 'Load example' button loads an example of an input list in BED format. The calculation parameters are set to default but can be changed by the user. By clicking on the 'Submit' button, the job will be submitted and the results will be presented automatically on the server. The provided sample data is PAR-CLIP binding data obtained for the human PUM2 protein1. The dataset was extracted from the doRiNA2 database.
1. M. Hafner, M. Landthaler, L. Burger, M. Khorshid, J. Hausser, P. Berninger, A. Rothballer, M. Ascano, Jr., A.C. Jungkamp, M. Munschauer, A. Ulrich, G.S. Wardle, S. Dewell, M. Zavolan, T. Tuschl, Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP, Cell 141(1) (2010) 129-41.
2. K. Blin, C. Dieterich, R. Wurmus, N. Rajewsky, M. Landthaler, A. Akalin, DoRiNA 2.0--upgrading the doRiNA database of RNA interactions in post-transcriptional regulation, Nucleic Acids Res 43(Database issue) (2015) D160-7.
K-mer length range: SMARTIV uses a k-mer-based algorithm to search for enriched motifs (Note: the length of the k-mers does not define the final motif length).
By default, SMARTIV provides a pre-defined range 5-7. The maximal length range allowed is 4 to 10 nucleotides. To select a specific length, insert the same value to both 'min.' and 'max.' boxes.
Motif type: SMARTIV is able to extract two types of motifs: a combined sequence and structure motif (in 8-letter alphabet) and a sequence-based motif (in 4-letter alphabet). SMARTIV provides an option to display only one of the motif types or both. By default, a combined sequence and structure motifs (in 8-letter alphabet) will be extracted.
Folding method: SMARTIV Supports several alternative folding methods from which to choose. By default, MFE implemented by the RNAfold tool from the ViennaRNA package, is selected.
Job name: An optional parameter that enables you to give your job an informative name, otherwise, the job will get a unique number identifier.
Email address: The "E-mail address" is an optional field, required in order to get a link to the results page. If you don't get an E-mail from SMARTIV within a reasonable time, check your spam folder, it might accidentally get there.
SMARTIV represents the best motif for each requested k-mer length.
For each motif (PWM) SMARTIV provides both a graphical presentation using the WebLogo software and the matrix itself as a text file. In addition, SMARTIV represents k-mers that were used to build the PWM (view an example of the result page).
WebLogo graphical representation: The PWM motif is represented as a logo, using an adjusted version of the WebLogo software. Logo can be downloaded in JPG or PDF.
P-value: The p-value presented above the logo reflects the correspondence between the derived PWM and the original scores of the sequences (derived from the CLIP experiment). It is estimated using the mmHG statistics, which evaluates the association between two ranked lists, assigning an FDR corrected p-value to each PWM (Steinfeld et al., 2013).
Matrix representation: The PWM (Position Weight Matrix) is available for download as a text file (view example).
By clicking on 'View the list of k-mers composing the motif', SMARTIV displays a table, including the significant exact strings of length k (k-mers) that were used to build the PWM and the related statistical information. The table is also provided for download as a text file.
K-mer: The exact motif string color-coded by the logo color scheme.
P-value: The value presented is the mHG score, corrected for multiple testing, which is a tight bound for the P-value (P-value ≤ corrected mHG score).
N: The total number of input sequences.
B: The total number of sequences containing the motif.
n: The index, in which the division of the input list into target and background by the mHG statistics, gives the optimal enrichment of the motif at the top of the list.
b: The number of sequences containing the motif among the n top sequences.
Enrichment: Measures to what extent the motif is found at the top of the list comparing to the total list. Defined as: (b/n) / (B/N).
For more information about the mHG statistics, please refer to: Eden et al. (2007)