Team:BIT/Model

Model
Introduction
Studies have shown that miRNAs, as biomarkers for colorectal cancer early screening, have higher diagnostic efficacy than detection of DNA expression level, DNA methylation content and other methods. Therefore, the project is to develop a biosensor diagnosis method based on miRNAs related to CRC as biomarkers. Since there are many miRNA targets for colorectal cancer diagnosis, we need to select stable and reliable miRNA as the criteria for colorectal cancer diagnosis. At the same time, the complexity of the tumor makes it very unlikely to rely on a single molecular marker for early screening, so it is necessary to determine the type and content of miRNAs to be detected in the early stage of the project. After obtaining the expression level of multi-target miRNA through detection, we need to conduct comprehensive evaluation of detection results, and then carry out risk classification according to the comprehensive evaluation results.
For the above questions raised in the project, the modeling part was carried out through bioinformatics related means. We used differential analysis, univariate COX survival analysis, multivariate COX survival analysis and other methods to screen miRNA-related biomarkers, found miRNAs highly correlated with CRC, and used them as the target detection objects of biological part. In view of the poor detection effect of single miRNAs, combined content detection was performed on the screened miRNAs, and a risk scoring model was established to convert the detected miRNAs content into a specific score value. Classification results were obtained according to the threshold, and CRC risk of the population was determined according to the classification results. At the same time, based on biological principles, the model part searched for gene pathways related to CRC, which not only verified the model results, but also deepened the understanding of gene regulation patterns.
Materials and Methods
In order to obtain valid and reliable data, we downloaded CRC gene expression profile data and relevant clinical information of samples based on the Human Cancer and Tumor Gene Atlas (TCGA) database. The "Pandas" library of Python was used for data preprocessing, and the "edgeR" software package of R language was used for differential expression spectrum screening. Cox regression analysis and survival analysis were performed on the samples with SPSS software to screen out miRNAs that had a certain correlation with the overall survival rate of CRC patients. Multivariate Cox analysis was performed on related miRNAs to construct a risk scoring model. The "DAVID" software and "PANTHER" software were used for GO and KEGG analysis of selected miRNAs.
Data acquisition and data preprocessing
In order to select colorectal cancer related miRNA, we first need to obtain miRNA expression data and survival data of patients. TCGA database contains clinical data, genome expression data, mRNA expression data, methylation data and so on. It is an important data source for cancer researchers.
Data obtained from the TCGA website included human miRNAs data from 339 samples (329 CRC and 10 normal paracancer tissues) with complete clinical information. Ftprush was used to download the miRNAs name class table, and the existing gene ids were converted into internationally widely used names. A total of 971 valid miRNAs were obtained. After statistics, the original data had many NA values, which may be the reason for the low expression of some miRNAs in human body. We counted the number of miRNA with 0 expression level in different patient samples, and deleted miRNAs with NA value whose expression level was more than 30% of patient samples. Meanwhile, 25 patients in the obtained data had more than two duplicate samples, so they were deleted.
Establishment of screening model of associated miRNAs
1.Screening for differentially expressed miRNAs
In terms of screening for associated miRNAs of CRC, we first used differential analysis to conduct preliminary screening. During the occurrence and development of CRC, the expression level of some miRNAs will change to some extent: the expression level of previously silenced miRNAs may be over-expressed, while the expression level of previously normally expressed miRNAs may be down-regulated. In patients with CRC, some miRNAs are obviously different from the normal miRNAs due to up-regulation. Therefore, biomarkers to be detected can be selected from these up-regulated miRNAs.
Differential analysis is a method that can detect biomarkers which are expressed differently between patients and healthy groups. The "edgeR" software package of R language was used to screen the differential expression spectrum. The group was divided into two groups: the "Normal" group with 10 sample data and the "Tumor" group with 329 sample data. Selection criteria for | log2fold change | > 1, p-value < 0.05. The differential expression volcano plot was plotted using the Gplots package in R language (Fig. 1). The results were analyzed automatically, and a total of 233 miRNAs were obtained, including 100 up-regulated miRNAs and 133 down-regulated miRNAs.
2.Screening highly correlated miRNAs
In actual detection, the detection effect of up-regulation miRNAs is better. We know that diagnosis accuracy of single miRNA is low, and it is obviously impossible to detect all 100 up-regulation miRNAs. Therefore, we continued to screen several miRNAs with the highest correlation with CRC by various methods.
COX regression analysis is through the mathematical statistics regression method, testing the variables in the experiment and problems lead to the results or not, and quantifying the effect of the results. Between univariate COX and multivariate COX, the main difference is that the number of variables: in univariate COX,only one factor is researched; in multivariate COX,multiple factor is considered.
We first performed univariate COX analysis on the data to screen out 7 differential miRNAs that have a certain correlation with the overall survival rate of CRC patient:hsa-miR203b_3p_1,hsa-miR144_5p_1,hsa-miR126_5p_1,hsa-miR15b_3p_1,hsa-miR130b_3p_1,hsa-miR3677_3p_1,hsa-miR193a_3p_1;
Multivariate COX analysis was performed on them, and 4 of them were obtained with statistical significance (P<0.05):miR144_5p_1,miR193a_3p_1,miR126_5p_1,miR15b_3p_1. These 4 miRNAs are all risk factors for the prognosis of CRC, and the higher the expression level is in tissues, the worse the prognosis of patients are.
In order to verify the correlation of miRNAs alone on the survival of CRC patients, kaplan-Meier survival analysis and ROC were performed for these four miRNAs, and the results were shown in Fig. 2 and Fig. 3.
K-M curve firstly calculates the probability that patients who survive a certain period survive the next period (i.e. survival probability), and then multiplies the survival probability one by one, namely the survival rate of the corresponding period. ROC curve is the receiver operating characteristic curve with TPR on the vertical axis and FPR on the horizontal axis, which describes the performance of a model, in this case, the predictive performance of the risk scoring model.
K-M curve results showed that the survival rate of the disease group containing one of the above four miRNAs had a certain degree of differentiation from that of the normal group, and had certain predictive performance for CRC, but the effect could not meet our requirement.
The analysis results of CRC for these four miRNAs were 0.615, 0.589, 0.566 and 0.620 (as shown in Fig. 3), indicating that the detection efficiency of the four miRNAs alone was low. Therefore, we considered the combined detection of these four miRNAs.
Establishment of comprehensive risk scoring model
After obtaining four miRNAs, we need to establish a comprehensive risk scoring model to convert the detected miRNAs content into scores, obtain the grading results according to the divided threshold, and judge the risk of CRC according to the grades.
Multivariate COX regression analysis was performed for miR144_5p_1, miR193a_3p_1, miR126_5p_1, and miR15b_3p_1. Based on the expression levels of four miRNAs and their corresponding partial regression coefficients, prognostic Risk Score model related to CRC was established. The calculation formula is as follows:
               Risk Score= -0.328×miR144_5p_1 + 0.496×miR193a_3p_1 + 0.578×miR126_5p_1 - 0.606×miR15b_3p_1(1)
Calculate the score of each sample according to formula (1), take the average value of each aggregated data as the sample point, and divide the score threshold: According to this formula, score ≤ 0.34 belongs to the low risk group, 0.34 < score ≤ 0.36 belongs to the low to medium risk group, 0.36 < score ≤ 0.44 belongs to the medium risk group, 0.44 < score ≤ 0.46 belongs to the medium to high risk group, and score > 0.46 was considered as high risk group.
We used the newly generated risk score as a new feature for K-M survival analysis, and the K-M survival curve showed that the survival time of the group with higher risk score was much less than that of the group with lower Risk Score. P-value is less than 0.0001, indicating that risk score is significantly correlated with survival time and the model construction is statistically significant.
The area under the ROC curve AUC=0.713, indicating that the multi-target diagnosis method and comprehensive risk score model we constructed for the combined detection of four miRNAs have higher prediction accuracy compared with single miRNA.
In hardware part, a SlipChip oriented multi-channel fluorescence detection module with high detection accuracy, high repeatability and high linearity can be constructed, and the fluorescence intensity can be converted into miRNAs content, and the miRNAs content can be converted into scoring values and grades easily understood through this scoring model. The combined detection of four miRNAs not only ensures the reliability of detection results, but also reduces the tedious operation of detecting too many kinds of markers, making the project operated easily and widely applicable.
Functional prediction and pathway analysis
In order to verify the correctness of the selected markers from the biological principle and deepen the understanding of gene pathway, we used biological software for functional prediction and pathway analysis of four miRNAs.
MiRDB was used to predict the target genes of four miRNAs, and software "DAVID" and "PANTHER" were used for GO and KEGG analysis of target genes. Multiple significant enrichment was found in the biological process (Fig. 6). The accumulation of cellular process, biological regulation and metabolic process is obvious, which mainly focus on metabolic pathway, growth and proliferation, and signal transduction. In pathway analysis, Wnt signaling Pathway, G1/S cell cycle, positive MAPK cascade regulation and TGF-β signaling pathway were significantly enriched. In KEGG analysis by DAVID software, ErbB signaling pathway, Ras signaling pathway and cancer pathway were significantly enriched, especially CRC pathway. (fig. 7)
Summary
The overall project quantifies the risk of CRC in samples by detecting the content of miRNA biomarkers. Model module successfully excavates 4 target miRNA molecules based on bioinformatics methods, establishes risk assessment model and determines the classification to assist the development of biological module. After obtaining the results, we verify and analyze them depending on the biological principles to deepen the understanding of related biological pathways.
At the same time, model module forms a connection with hardware module: converting the fluorescence value measured by the hardware device into the miRNAs’ content which is brought into the model to calculate the scores and get the classification. Meanwhile, hardware part also verifies the feasibility of the model method.