Team:SJTU-Software/Model

  

Team:SJTU-Software/Model - 2021.igem.org

Statistical Learning

A set of machine learning theories uses statistical methods established by Vapnik, and it is different from other machine learning methods such as inductive learning. The support vector machines derived from this set of theories have made outstanding contributions to the theoretical field of machine learning and various application fields.

1 Differential Expression Analysis of Genes

By studying the differential expression of genes, we can find:

  • Cell-specific genes
  • Developmental stage-specific genes
  • Disease state related genes
  • Environment-related genes

The basic method is to calculate gene expression in a biologically significant way, and then use statistical analysis to find genes with statistically significant differences.

Assumption1: The expression level of genes under different treatments is the same.

Assumption2: The sample obeys a normal distribution.

There are two ways to find differentially expressed genes:

i) Fold change

The basic formula are listed as :

G1 and G2 stands for gene group's expression levels.

ii) T-test

The t-test, also known as the student t test (Student's t test), is mainly used for normal distributions where the sample size is small (for example, n 30) and the population standard deviation σ is unknown. The t-test uses the t-distribution theory to infer the probability of the difference, so as to compare whether the difference between two averages is significant. It is juxtaposed with f-test and chi-square test. The t-test was invented by Gorst in order to observe the quality of wine and was published on Biometrika in 1908.

The single-population t-test statistic is:

2 Supporting Vector Machine

Researchers need a classification model to discriminate cancer patients samples from healthy people samples.

Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification of data according to supervised learning. Its decision boundary is the maximum margin for solving the learning sample. SVM uses the hinge loss function to calculate empirical risk and adds a regularization term to the solution system to optimize structural risk. It is a classifier with sparsity and robustness. SVM can perform non-linear classification through the kernel method, which is one of the common kernel learning methods. SVM was proposed in 1964. It has been rapidly developed after the 1990s and has derived a series of improved and expanded algorithms, which have been applied in pattern recognition problems such as portrait recognition and text classification.

The distance from each support vector to the hyperplane can be written as:

To maximize this distance:

given that , we can obtain the final optimization problem as:

For samples that cannot be completely linearly separable, we need to add a soft interval strategy. In that case, the optimization problem can be presented as:

Kernel Method:

One situation that may be encountered is that the sample points are not linearly separable. The solution to this situation is to map the two-dimensional linear non-separable samples to the high- dimensional space, and make the sample points linearly separable in the high-dimensional space.


We use x to denote the original sample point, and use Φ(x) to denote that x is mapped to the new feature space of the feature and then to the new vector. Then the split hyperplane can be expressed as:

The mapping operation will be conducted with kernel function. Commonly used kernel functions are listed below:

3 Probe Design

The automatic probe design relies two method: dynamic programming and deep learning.

3.1 SHS:Spurious Hybridization Searching

Dynamic programming is a branch of operations research, which is the process of solving the optimization of the decision-making process. In the early 1950s, when American mathematician R. Bellman and others were studying the optimization problem of multi-stage decision-making process, they proposed the famous optimization principle and created dynamic programming. The application of dynamic programming is extremely wide, including engineering technology, economics, industrial production, military and automation control and other fields.

The dynamic programming consists of three crucial steps:

  1. Establish state transition equation
    The state transition equation characterizes the transition relationship between each sub-state.
  2. Cache and reuse past results
    Use a certain data structure to store the results that have been calculated.
  3. Recursion
    Starting from small-scale problems, iteratively solve the final large-scale problems step by step.

Dynamic programming:



We already know some of the key factors that can affect the results of the strand-displacement reaction, and the most influential one is whether the probe will form significant secondary structure (the nature of itself). However, there are still other factors we fail to consider. So here we propose a tool called SHS (Spurious Hybridization Searching), which can quantitatively evaluate the hybridization of the probe with other nucleic acid substances in the system. In fact, SHS also gives considerations to some empirical rules.


Considerations

Avoidance of spurious hybridizations with other substances in the system:

With a large number of substances (different miRNAs, different probes and possible impurities) coexist in the system, probes (toehold sites) may carry out unnecessary reactions (spurious hybridization). We implements an algorithm similar to the Needleman-Wunsch algorithm to give the best alignment result between two strands. Our algorithms take structure information like loops, bulges, helices into consideration (different sturcture will have different impact on the total score), and modify the score matrix to meet the requirement of probe-design. For example, if the number of consecutive matches exceeds the threshold, the score will increase exponentially.

Minimizing guanine frequency:

Among four DNA nucleotides, guanine (G) is the most problematic one. First, it can bind to thymine nearly as strongly as adenine. Second, it can form guanine quartets which may influence the process of the reaction.


Avoidance of long A/T and long G/C regions

Long uninterrupted regions of week A/T bindings will melt at low temperature and breath significantly, so most of the time it's better to have probes with mixed base distributions. And the hybridization rate constant is much lower when the probes have long continuous region of A/T. Meanwhile, long uninterrupted regions of strong G/C bindings tend to hybridized to other strands in the system, because the strong G/C binding will counteract the destabilizing influences of DNA bulges and mismatches. Also, the high base stacking force may in turn hinder the smooth process of branch migration process.


Score system

We integrate the considerations above into a scoring system which can quantitatively adjust evaluate the performance of certain probe. If the score are high, then it means the probe violate many rules we consider before, so it tends to be a bad probe. You can see how we specifically define the scores of different rules in the source code of SHS.



Our pseudo-code is demonstrated below:

3.2 Deep Learning

Deep learning is a new research direction in the field of machine learning. It is introduced into machine learning to make it closer to the original goal-artificial intelligence.Deep learning is to learn the internal laws and representation levels of sample data. The information obtained in the learning process is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to recognize data such as text, images, and sounds. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition far surpassing previous related technologies.

The most basic idea of deep learning is to use nonlinear optimization methods to optimize the loss function. The loss function represents the gap between the predicted value of the model and the true value, and the value of the loss function is related to the parameters of the model. By continuously optimizing the model parameters to minimize the value of the loss function, an optimal model (a complicated function ) is obtained.


Loss Function


Nonlinear Optimization Methods: Gradient Descent

Model Structure

We have built three deep learning model for our model.

ModelV1:


Structure of model version 1.0:

In designing our model, we have taken the following factors into consideration:

  • Nucleic acid sequences vary in length, and individual models may have poor generalization ability when the length varies widely.
  • Pseudoknot structures may be present in nucleic acids.
  • For the same nucleic acid sequence, different secondary structures may exist in different environments.

Therefore, we have adopted the following strategy for the characteristics of nucleic acid sequences and their secondary structures.

First, we use a simple comma to complement the different sequences to the same length. After that, the sequence information of character type is converted into structured information using one-hot encoding for computation.

Considering that different bases of the same sequence may have mutual constraints and influences on each other, we use the Transformer model to extract attentional information.

We use the outcat strategy to expand a one-dimensional sequence into a two-dimensional matrix.

Each channel of the two-dimensional matrix characterizes the interrelationship of the base with all other bases. We use residual networks and global pooling techniques to extract global information. We will refer to such a sequence as ‘Channel Attention Sequences’.

Since there is an interrelationship between each channel of the 2D sequence, we use Transformer to extract this information. Finally, we consider that there is a direct interaction between adjacent bases, for example, the pairing probability between two adjacent adenines and thymine is low. The output of our model is normalized to between 0 and 1 by a sigmoid function. This result represents the pairing probability of each base.

ModelV2:


Structure of model version 2.0:


After our continuous testing, we realized that ModelV1 has the following problems:

  • If the sequence takes a one-hot coding scheme, the model will recognize all bases of the same type at all positions in a sequence as identical. This is not biologically correct, for example, adenine and thymine located at different distances obviously have different pairing probabilities.
  • Convolutional neural network-based feature extraction methods are not able to extract global information effectively.
  • The output of the model is only the base pairing probability, which does not directly lead to the final secondary structure.

We have adopted the following strategies for the problems that have arisen above.


First, we add the location code to the original heat code. Position coding can effectively characterize the position of bases in the whole sequence.

where the frequencies are:

Trigonometric functions graph:

Graph neural networks are able to efficiently characterize information with graph structure. The secondary structure of nucleic acids is a typical graph structure, where each base can be considered as a point and pairing as an edge. We used a graph convolutional neural network to extract global information.

Graph of secondary structure:

The dot-backer can be used to represent the secondary structure of a nucleic acid. We modify the output of our model to a matrix of L * d, where L denotes the length of the sequence and d denotes the type of base.

After testing, our second generation model has good prediction results on most data sets. The prediction results on short series have surpassed the SOTA model[2].

ModelV3:


Structure of model version 3.0:


However, there are still some flaws in our model.

  • Position encoding based on trigonometric functions has poor interpretability and lacks biological significance.
  • The sequences predicted by the model do not exactly match the pairing rule.
  • The amount of model parameters is too large and there may be a risk of overfitting.

In our third generation model, we use the word2vec model instead of positional encoding. This model can efficiently extract the position information of a specific base in the whole sequence.

Word2vec Model:



To ensure that the predicted secondary structure conforms to the biological rules, we designed a simple algorithm for correcting the results. If a nucleotide does not have a corresponding paired base, we choose to assume that this nucleotide is unpaired.

If the size of a model parameter is too large, it may have the possibility of overfitting the data set. According to Occam's razor law: Do not add entities if you don't have to. As a result, we streamlined our model, reducing its parameter size by a factor of 15.7, while obtaining better prediction results.

Graph Convolutional Networks

In the spectrum-based graph neural network, the graph is assumed to be an undirected graph, and a robust mathematical representation of the undirected graph is the regularized graph Laplacian matrix, that is:

The regularized graph Laplacian matrix has the property of real symmetric positive semi-definite. Using this property, the regularized Laplacian matrix can be decomposed into: