Team:SJTU-Software/Engineering

  

Team:SJTU-Software/Model - 2021.igem.org

To guarantee the successfulness of our project, we follow strictly with the standard engineering cycle. We spend a lot of energy to make sure our project is successful and can generate value for the scientific research process.

Research

We began the project by extensively searching the literature, and we also consulted doctors and professors in related fields. At the same time, we posted questionnaire to investigate the masses' awareness of early cancer screening. After the research, we find that there is an ever-increasing need for rapid, non-invasive, yet accurate methods for cancer diagnosis.




  • The questionnaires show that many people in China are dissatisfied with the cumbersomeness of the current early cancer screening.
  • Prof. Han says that DNA-computation is a novel detection method with great potential.

Recent researches have shown that the levels of multiple microRNAs in serum are informative biomarkers for the early diagnosis of cancers. And since DNA can interact with different molecules, transduce the signals and report the results in a programmable manner, DNA molecular computation actually provides powerful tools to analyze miRNA profiles in clinical serum samples. So we are aiming to construct a platform to help researchers better realize the process of DNA computation.

Design

In order to successfully build our platform, we list the problems in related fields that need to be solved:

  • Pre-experiment Data Gathering
    Data gathering tasks are cumbersome due to the complicated and numerous natures of medical data. For wet-lab researchers, such tasks will take up a great deal of time.
  • Data Analysis
    The data analysis process is crucial to wet experiments. For example, to decide which variable is vital, they are supposed to conduct a discrepancy analysis. However, most researchers are unfamiliar with current data analysis tools and algorithms.
  • Probe Design
    The successful realization of DNA computation requires appropriate probe design. However, probe design is the most tedious and laborious task in DNA computation technology. The current probe design process in the wet laboratory relies on the experience of the experimenter. However, even with extensive experience, they still need to repeat many experiments to verify the feasibility of the probe. With modern computer technology, this process can be simplified.

Based on the above considerations, we will design corresponding software to solve the problems. This will ensure that our project will not deviate from the original concerns.

Build

Data Gathering

We collect miRNA expression data from public databases, such as TCGA and NCBI. For the convenience of users, a database with detailed information is constructed and integrated into our software.


Data Analysis tools

In order to help biological researchers better perform data analysis, we provide most frequently used data analysis tools in our software.


Differential Expression Analysis of miRNAs

By studying the differential expression of miRNAs, we can find miRNAs with special properties, for example, disease state related miRNAs.

The basic method is to calculate miRNA expression in a biologically significant way, and then use statistical analysis to find miRNAs with significant differences. Differential analysis are powerful to screen suitable miRNA targets. These miRNA targets are effective biomarkers in early cancer diagnosis.

Supported Vector Machine (SVM)

After gathering the data, we need to know quantitatively how the indicators determine the final result. Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification of data according to supervised learning. Since the clinical data is often precious and limited, SVM, which has good classification results for data with a small size, can handle these data well.


Probe Design tools

During DNA computation process, there are few key stages to decode the mathematical form of the classifier into a biologically feasible computational strategy. These pictures are from Han Lab.

Typical strategies for the implementation of mutiplication operation:




Typical strategies for the implementation of addition operation:




Typical strategies for the implementation of substraction operation:




The successful realization of DNA computation requires appropriate probe design. The interaction between probe and miRNA is based on strand displacement reaction. However, this process can not always carry out smoothly. Here we provide two practical and powerful tools in our software, to help experimenters better design probes.

SJTUFold: Prediction of secondary structure of nucleic acid molecules

We use the pytorch framework and combine the current popular graph neural network and natural language processing technology. In terms of method, this is a new idea that is different from any existing method of secondary structure prediction. When the sequence length is short (<100bp), it has obtained better results than the existing secondary structure prediction tools.

SHS: Spurious Hybridization Searching

We develop SHS to evaluate the propensity of a probe to form spurious hybridization with substances coexisting in the system. We use C/C++ as the underlying language to write program. Compared with other languages, C/C++ can provide faster calculation speed, so it is qualified for such computationally large tasks.


Realization of the front and back ends of the software

Back-end

The back-end framework we use is the python-based Django, and database management system we use is mysql-8, whose updated encryption improves the security of the database. Django also facilitates our front-end and back-end development.



Front-end

We use HTML, CSS and Javascript language, with the help of Bootstrap framework to complete the structure of the web page, page rendering and other work.

Test

Data analysis

We have collected a large amount of clinical data from databases such as NCBI and TCGA. These data sets include miRNA expression levels of normal people and patients.

This is an intuitive expression of the lung cancer data.




We performed differential expression analysis on the dataset and use differentially expressed miRNAs to construct SVM classifier. These are important preprocessing steps in the realization of DNA computation. The result confirms the correctness and usefulness of our tools.

For more details, click here Proof Of Concept.


Secondary Structure Prediction

We used the test set to measure the predictive ability of our model:

F1 score and Accuracy of Model V1

F1 score and Accuracy of Model V2

F1 score and Accuracy of Model V3

From the results, the predictive ability and robustness of our model have been improved, which proves the effectiveness of our model. (In fact, the process from model V1 to V3 is step by step, you can see the details in part "The improvement of deep learning model".)


Wet-lab validation

Two of our collaborators conducted wet experiments to verify the practicability of our software. Because the purpose of our project itself is to assist DNA computation, so the experiment of Dr. Qian Ma can utilize all the functions of our software, from data analysis to probe design. For the experiment of SJTU-BioX-Shanghai, they are aiming to add two probes to both ends of the aptazyme without changing the original structure of its functional domain. So they mainly use the secondary structure prediction tool in our software.

For more details, click here Proof Of Concept.


User's feedback

After using our software, Dr. Qian Ma said that it was efficient to predict the secondary structure of nucleic acid and hybrid situation of the DNA probes, which help them spend less time on testing design through actual lab experiment. However, she proposed that our software had a low sensitivity to biomarkers of other diseases, which still needs more trial and iteration.

By utilizing our software, members from SJTU-BioX-Shanghai conducted TMSD design successfully, which was truly helpful to facilitate their project. As for the DNA computation platform, they thought that we could make optimization on integrating connections between each part.

Learn

The improvement of data

After exploring the feedback from Dr. Qian Ma, we thought the scalability of models needed to be improved. Accordingly, a great number of the data about other diseases from NCBI and TCGA was collected and integrated into our models so that our software could have stronger functions and broader application fields.


The improvement of deep learning model

In the first model, we have adopted the following strategy for the characteristics of nucleic acid sequences and their secondary structures:

  • complement the different sequences to the same length
  • one-hot encoding
  • outcat strategy and 2D-CNN
  • Transformer model to extract attentional information

However, the first model only received a modest result. It has the following problems:

  • One-hot coding scheme will make the model recognize all bases of the same type at all positions in a sequence as identical.
  • CNN-based methods are not able to extract global information effectively.
  • The output cannot directly give the final result.

In response to the above problems, we have improved our model using the following methods:

  • Position encoding
  • Graph neural networks
  • Use dot-backer to represent the structure

The performance of our model increased after the improvement. But there are still flaws in our deep learning model. The sequences predicted by the model do not exactly match the pairing rule. At the same time, the lack of DNA secondary structure data makes us only use RNA data for training, but this is actually biologically unreasonable. So we think of some methods to resolve these problems:

  • We design a simple algorithm for correcting the results, making it follow the pairing rule.
  • Transfer Learning. Because there are still similarities between DNA and RNA, we first train our model with a large amount of RNA secondary structure data, then use relatively little DNA single-stranded secondary structure data to fine-tune our model.


In the end, our model get a good prediction effect. It is to learn and improve step by step that our model can gain the final effect.


The improvement of user interactivity

We held thorough discussion over the conjunction of engineering parts, building more user-friendly interfaces.

  1. With regard to effective microRNA markers, we constructed a database for researchers to conveniently find wanted reagents.
  2. As for the final output of our software, we provide visualization tools for predicted structure of probes.
  3. In case that researchers may find it hard to understand some of the process, we offered detailed guidelines in the protocol.
  4. For people who want to get familiar with our software, we provide some examples to make the process clearer to the users.

Maintenance

In the future, we will continue to upgrade our software. On the one hand, our project has good scalability. In theory, all diseases with significant differences in miRNA expression can be detected by DNA computation method. We can collect more clinical data to help researchers conduct studies on various diseases. On the other hand, probe design is an important part in DNA nanotechnology. It can be used to inhibit enzyme activity, cascade reactions, display fluorescence, etc. We will refine the probe-design tools in our software, so that it can help researches in related field.