(DEBUG version)
Version 2.0
SMARTIV Overview
SMARTIV is a web accessible computational tool for discovering combined sequence and structure binding motifs for RNA Binding Proteins (RBPs) from in-vivo binding data. The algorithm relies on the sequences of the target sites, their ranking by scores and their predicted secondary structures.
The motifs are presented as graphical logos presenting the combined sequence and structure information in an eight-letter alphabet (A, C, G, U, a, c, g, u - upper case for unpaired and lower case for paired nucleotides), which is informative and easy for visual perception.
SMARTIV methodology
Pre-processing of the data: SMARTIV input is a list of processed sequences from a given CLIP-based experiment or downloaded from a CLIP database. The list should be sorted by sequence scores in descending order (higher binding signal/noise ratio sequences at the top of the list). For calculation efficiency, SMARTIV selects from the ranked list 1000 sequences from top of the list and up to 9000 sequences from the bottom of the list. The input must contain at least 2000 sequences of more than 20 nucleotides.
RNA secondary structure prediction: SMARTIV offers four different approaches for RNA secondary structure prediction: 1) Minimum free energy (MFE) structure prediction, 2) Maximum expected accuracy (MEA) structure prediction based on partition function, 3) Centroid structure based on partition function, and 4) Most probable shape structure representative. For this purpose we use two widely used RNA folding programs defining each nucleotide of the sequence as either paired or unpaired: the RNAfold, from the Vienna RNA package, and RNAshapes.
Translating the sequences to a combined sequence and structure alphabet: SMARTIV considers the original length of the CLIP sequences and translates the sequences to an eight-letter alphabet (A, G, C, U, a, g, c, u), where each position in the sequence holds the information for both the nucleotide identity and its predicted secondary structure (paired/unpaired). The capital letters stand for unpaired nucleotides and the lower-case letters stand for the paired nucleotides.
The following steps are performed separately for the combined sequence and structure list (eight-letter alphabet) and the original sequence list (four-letter alphabet):
Extracting enriched k-mers from the ranked CLIP data: SMARTIV algorithm is based on the assumption that binding motifs are derived from overrepresented sub-sequences of length k (k-mers) that occur more frequently in the bound sequences (as defined by the experimental assay). To extract enriched k-mers, SMARTIV employs the DRIMUST de-novo motif search algorithm, which is a rank-based approach for detecting imbalanced enriched motifs (Leibovich et al., 2013 and Eden et al., 2007). DRIMUST searches the k-mers at the top of the input sequences list, where the top of the list is dynamically determined by the mHG statistics without a requirement to define bound versus unbound. For each k-mer, DRIMUST assigns a statistical significance value using an mHG score, corrected for multiple testing, which is a tight bound to the p-value (p-value ≤ corrected mHG score). SMARTIV uses k-mers that have passed the threshold of 10-5 for the combined sequence and structure data or 10-6 for the sequence only data.
Clustering and aligning the k-mers: The clustering of the k-mers is performed for each length k separately using VSEARCH, a greedy centroid-based algorithm with an adjustable k-mer similarity function. Prior to clustering, SMARTIV sorts the enriched k-mers by their p-values, obtained by the DRIMUST algorithm. Briefly, the clustering process starts by selecting the k-mer with the lowest p-value, which is then used as the cluster centroid. Subsequently, k-mers are added to the cluster if their similarity to the centroid is above a certain threshold. The process continues until all k-mers are assigned to some cluster. Further, VSEARCH aligns the k-mers in each cluster, prohibiting internal gaps.
Building Position Weight Matrices (PWMs): To generate a PWM from a given cluster, SMARTIV multiplies each k-mer in the aligned cluster by the number of times the k-mer was found at the top of the list, as defined by the DRIMUST algorithm parameter b. For the graphical representation, SMARTIV uses a modified version of the WebLogo algorithm, adjusted to present the PWMs for both the eight-letter and the four-letter alphabet.
Assigning occurrence scores and p-values to the PWMs: To select the best motifs for a given RBP, each cluster is assigned an occurrence score, which is defined as the total number k-mers occurrences (within the cluster) at the top of the sorted CLIP sequences list. Occurrence score is used for ranking the clusters derived from a set of k-mers of a given length k in a given CLIP dataset.
SMARTIV assigns a p-value to each PWM based on the correspondence between ranking the sequences according to their match to the PWM and ranking them by the original sequence scores, derived from the CLIP data. The p-value is calculated using the mmHG statistics, which evaluates the association between two ranked lists (Steinfeld et al., 2013).
Ranking the sequences by their match to the PWM is done by scanning each sequence against the PWM, calculating a log-odds score for each sub-sequence, where the background probabilities are defined as 0.25 and 0.125 for the four-letter and eight-letter alphabet, respectively.
Selecting the best PWMs to present: SMARTIV enables the user to search for motifs within a range of k-mer lengths. For each requested k-mer length, the most significant motif is presented together with its assigned p-value (if significant one is available).