Machine learning approach overview
Machine learning is a set of methods based on finding statistical relationships between a target variable and features. These relationships are sought in the form of some distribution function that depends on parameters. Parameters are initialized with some random values. The learning process is the search for such parameter values that the function most accurately repeats the type of dependence of the target variable on features. For this search, some kind of metric of correspondence between the true function and the one that we have at the moment is specified, then the search for the optimal values of the parameters is carried out by the method of gradient descent in the parameter space.
The prediction task in machine learning implies the following standard pipeline:
1) investigation of the problem, scientific area, recent researches and solutions of this problem
2) finding data sources for a task, collecting data and forming a dataset of positive and negative examples
3) exploratory data analysis -- close study of collected data: drawing distribution of feature values, searching for some simple correlations, filling of empty values, filtering out outliers
4) sequential examining of different model architectures (the kinds of function that depends on parameters) in order to estimate which model will perform best for this particular task, then training the model over a sufficient number of epochs (so that the estimated accuracy reaches a plateau)
5) deployment of the model so that it can be used without going into implementation details -- this point is often ignored by bioinformaticians, so their models cannot be used by biologists working in wet labs
In relation to our task
1) Problem to solve: lncRNA and miRNA are molecules that are situated in the cytoplasm during interaction. The interaction is carried out through the Ago protein. In addition to their paired interaction, lncRNA and miRNA also interact with proteins and mRNA. The effectiveness of the interaction depends on the local concentrations in a given cell compartment, as well as on the affinity of the interaction, which depends on sequence complementarity, 2D and 3D structure.
2) Which sources of data are available: to find out and validate the interaction qPCR methods and various types of sequencing after enrichment for interacting regions ( HITS-CLIP, PAR-CLIP) are used. Information about such experiments is published in scientific articles, then collected in various databases that are used by scientists from all over the world to check the RNA of interest. Also, various indirect data can be added to the model as features, which is desirable to be easy to obtain for any pair of miRNA-lncRNA, so that the model can be used to predict the interaction for as many lncRNA and miRNA as possible. Such data can be levels of expression, secondary structure, information on interactions with various proteins. Expression data can reveal interactions if some correlations in the expression of interacting RNAs can be seen in the data. Secondary structure and interaction with proteins can influence the affinity of the interaction.
3) Data usage in our research: We chose the lncRNASNP2 and DIANAv3 databases as the data sources for the target variable of our model (that is, the interaction facts). The lncRNA sequences taken from the ensembl database, miRNA sequences from the mirBase database were used as input data, and the secondary structure was calculated using the RNAfold program. Since the databases mainly contain information about positive examples of interaction, the negative examples needed to train the model had to be constructed independently. We did it in the following way: to be sure that sampled pair of miRNA and lncRNA does not interact, we not only required that this pair is not presented in any database of positives, . It should be specially noted here that a good data sample is the cornerstone of building the ML model, it is said: garbage in - garbage out. In order to correctly form a sample for our problem, it is necessary to have competence both in the field of molecular biology and in the field of applied mathematics. Specifically, for this task, it is necessary that lncRNA and miRNA in positive and negative pairs do not differ significantly, since otherwise the model will tend to overfit (just memorize the answer for a pair according to which lncRNA and miRNA are involved in it, and not look for patterns that facilitate interaction)
4) In order to try to understand what actually determines the interaction of lncRNA and miRNA, we evaluated the rates of alignment of miRNA and lncRNA by groups of positive and negative examples, as well as the correlation coefficients of expression levels. To our surprise, no significant difference was found (PICTURES_HERE), from which we concluded ... retraining. Expression data was only used to filter negative examples as described above.
5) We have chosen a combination of CNN and RNN as the architecture of the model, as they are well suited for this task. CNN can capture local patterns of interaction, and RNN helps to detect colocalization of these patterns and their mutual reinforcement. The architecture of our model is clearly shown in the figure PICTURE_HERE
6) Made the model predictions available from the wiki
7) Target location prediction:
- Unet segmentation: Based on the clip seq analysis data, the interaction coordinates relative to the lncRNA transcript were found and sequences of these interaction sites were obtained. The lncRNA sequence was encoded (one hot encoder) and cut into fragments, a microRNA sequence was added to each fragment that interacts with this lncRNA. The architecture consisted of 2 encoder layers (3 convolutional layers and max pooling). During the training process, the model tended to predict only class 0, the redistribution of weights for classes when calculating the error and 1000 training epochs did not change the result
- DNAbert: an NLP model from the bert family (fully connected models with