Team:NJTech China/Model

Fermentation model

Background

As designed in the experiment, the yeast with the paas gene produces 2-phenylethanol（2-PE）while growing under the optimal pH value of 4.5~5.0, sufficient water, oxygen, and growth temperature of 30℃. The yeast will continuously produce 2-PE during the growth process, and finally the 2-PE concentration in the shake flask will reach the saturation value.

Therefore, we will apply the growth model to the fermentation tank of industrial production. The goal of the model is to simulate the growth of yeast in the fermentation tank environment, the consumption of substrates and the change of product 2-PE concentration, aiming to explore the optimal mechanism of regulation yeast fermentation, such as controlling the appropriate feeding time, selecting the best time to harvest the product, and so on. For simplicity, we have made the following main assumptions.

Assumptions

1. We ignore the changes of other environmental factors after the start of fermentation

When the fermentation starts, in addition to the important restrictive factors such as the concentration of the substrate, We assume that changes in oxygen concentration, pH and other factors will not significantly affect yeast fermentation.

2. After the start of fermentation, the composition of the fermentation broth is uniform

Due to the shake flask fermentation method, the entire fermentation system is in a uniform state after the start of the fermentation, and there will be no deviation due to partial non-uniformity.

Parameters definition

Mathematical mode

(1)Yeast growth model

It is known from the literature that among the models proposed by many scholars to describe the growth of bacterial, the Monod equation and the Logistic equation are the most widely used.

However, the Monod equation is mainly used to describe the growth of colonies under the condition of slow cell growth, low density, and single substrate restriction. This condition is too ideal. Judging from the experimental conditions and the fermentation curve of yeast fermentation process, the influence of yeast cell biomass and product 2-PE concentration on cell growth cannot be ignored. Therefore, the Monod equation is no longer applicable. In this experiment, we adopt the Logistic equation to simulate the growth of yeast.

The Logistics equation is a typical S-shaped curve, which is widely used in describing the growth of microorganisms and performs greatly. It can fit the lag phase, logarithmic growth phase and stable phase of the yeast growth curve well . The specific expression of the equation is:

When t=0, the initial biomass of yeast at the beginning of the fermentation C _x=C ₀ is brought into the analytical solution of the equation:

Take the logarithm of both sides of the equation to get the equation:

Parameter values μ_max、C_max obtained by drawing.

(2) Kinetic model of product growth

According to the characteristics of the fermentation product, the relationship between the product and the growth of the bacteria is mainly divided into the following three categories:

①The formation of product is not coupled with the growth of the bacteria: the product and the bacteria exist at the same time;

②The product formation is coupled with the growth of the bacteria: the product is only produced when the bacteria grow;

③Product formation is partially coupled with the growth of the bacteria: part of the products are formed during the growth of the bacteria, and most of the products are formed during the stable growth of the bacteria.

For the product growth model, we use the Leudeking-Piret equation to describe the production law of product 2-PE：

(1) When α≠0, β=0, growth coupled type

(2) When α≠0, β≠0, partial growth coupled type

(3) When α=0, β≠0, non-growth coupled type

When in a steady state, the yeast biomass hardly changes with time, that is, C_x=C_max, dC_x/dt=0 bring in：

At the same time, when the fermentation just started, the concentration of the product 2-PE is approximately equal to 0, that is, when t=0, P=0 is brought in, and we get:

By plotting B(t) versus A(t), the slope of the straight line obtained is α. We determine the values of α and β by fitting experimental data, and determine the production model of 2-PE.

(3)Kinetic model of substrate consumption

In the fermentation and synthesis process of 2-PE, the consumption of carbon source mainly includes:

①Constitute bacterial components for cell growth;

②Maintain the growth of yeast and its metabolic needs;

③Synthesis of primary and secondary metabolites such as 2-PE.

This model adopts the Leudeking-Piret-like equation to simulate the consumption process of the substrate. The mathematical relationship is:

Then we can get :

Where:

Plot A versus M(t), and the slope of the line obtained is γ.

Results and Analysis

Shake flask fermentation experiment data

Figure 1 shows the curve of yeast concentration, phenethyl alcohol production, and substrate concentration over time during yeast fermentation.

Figure 1

It can be seen from the figure 1 that after the inoculation of the yeast, due to the rich nutrients in the medium in the early fermentation stage, the yeast grows rapidly and metabolizes vigorously, and the growth rate is fast, and it can quickly enter the logarithmic growth phase. In the late stage of yeast growth, due to the exhaustion of nutrients in the medium and the accumulation of harmful metabolites, the living environment of yeast continues to deteriorate, resulting in the death of yeast and the decline of yeast biomass. It can be seen that the growth of yeast basically conforms to the S-shaped growth curve, the biomass can reach the maximum in about 6 days, and the increase in biomass inhibits the growth of yeast, which conforms to the applicable range of the Logistics equation, so It is appropriate to use it to describe the growth law of yeast in the fermentation process.

At the same time, in the early stage of yeast fermentation,2-PE is less synthesized; in the logarithmic growth stage and saturation period of yeast, the synthesis of 2-PE increases steadily, indicating that the relationship between synthesis of 2-PE and the growth of the yeast belongs to non-growth coupled type.

Yeast growth model

According to the kinetic model equation and plotting versus t, the relationship between the two can be obtained as:

Correlation coefficient R²=0.9611, which can be calculated μ_max= 0.1449

Bring the parameters into the equation to get the kinetic model of colony growth as:

Perform nonlinear fitting on the parameters of the data, as shown in Figure 3. It can be seen from the figure 3 that in the early and middle stage of yeast growth, the predicted value and the experimental value have a high degree of fit, but the deviation is high in the late stage of yeast growth. This is because the substrate is consumed and is not enough to maintain the basic life activities of the yeast, resulting in a significant decrease in the biomass in the later stage.

Figure 3

Product generation model

Through data fitting, we get the relationship between [P-βB(t)] A(t) as:

Where R²=0.9899

Bring the parameters into the equation-the kinetic model of colony growth is:

As shown in Figure 5, the shake flask fermentation curve of 2-PE shows that at the initial stage of yeast growth, yeast growth is proportional to substrate consumption, and the production of 2-PE is less; in the logarithmic phase of yeast growth, the Synthesis is directly proportional to substrate consumption. This indicates that the synthesis of 2-PE and the growth of yeast belong to a partial growth-coupled relationship.

Figure 5

Substrate consumption model

By plotting [S₀-S-δN(t)] versus M(t), we get the relationship between [S₀-S-δN(t)] and M(t) as：

Where R²=0.9511

Bring the parameters into the equation-the kinetic model of colony growth is:

Figure 7 shows the fitting curve of the substrate consumption during the shake flask fermentation process. We found that the Leudeking-Piret-like model has a great fitting effect. The main reasons why the substrate is slow in the early stage are: the yeast has not entered the logarithmic growth phase after inoculation, and at the same time, the production of 2-PE is small; when it enters the late fermentation period, the substrate consumption tends to be stable.

Figure 7

Conclusion

Experiments have proved that the yeast introduced with the paas gene has strong biological activity and can be used as a dominant strain for the production of 2-PE.

Based on the understanding of the metabolic characteristics of yeast production of 2-PE, yeast growth and substrate consumption, etc., a fermentation kinetic model for the synthesis of 2-PE in shake flasks was constructed. The predicted value and actual results can be well matched. This provides a further basis for the fermentation production control of 2-PE.

Reference

1. Zhou, Q.; Liu, Y.; Yuan, W., Kinetic modeling of butyric acid effects on butanol fermentation by Clostridium saccharoperbutylacetonicum. N Biotechnol 2020, 55, 118-126.

2. Wang, T.; Lu, Y.; Yan, H.; Li, X.; Wang, X.; Shan, Y.; Yi, Y.; Liu, B.; Zhou, Y.; Lu, X., Fermentation optimization and kinetic model for high cell density culture of a probiotic microorganism: Lactobacillus rhamnosus LS-8. Bioprocess Biosyst Eng 2020, 43 (3), 515-528.

Expansion model

Based on the second product produced by our project, it can be sprayed directly on the land to produce 2-PE, in order to produce a fragrant smell. We tested it in the experimental field and established a second model- yeast expansion model.

The proliferation of microorganisms in the environment has always been an interesting topic. However, due to the complexity of the system, it is very difficult to study precise motion. Therefore, we urgently need to establish a model to simulate the growth and spread of yeast on the ground.

The model is divided into two parts, the growth model and the diffusion model, and the increase in biomass at t+1 in each small area is determined by two parts: the diffusion volume of the surrounding area and the self-growth of yeast in the area at time t. and:

Where：

u represents the biomass of yeast in a certain unit area,u(G.N) and u(E.N) represent the growth and diffusion of yeast in time Δt.

Model assumptions

In order to ensure the rationality of the model, we have made the following assumptions:

Yeast growth stage

The main forces that promote the movement of microorganisms are:

1. The concentration difference of the substrate: Because of the different nutrient concentration in different areas, the yeast will spontaneously move from the area with lower nutrient concentration to the area with higher nutrient concentration. At the same time, we believe that the concentration of nutrients in each area of the soil is initially equal in this model.

2. Cells "sliding" or "pushing": When cells grow rapidly, they tend to occupy the entire space. As the concentration reaches a certain level, they tend to push each other and the colonies begin to expand outward.

The factors we don't consider are:

Changes in yeast concentration due to various factors in the natural environment. For example, the temperature is too high or too low to cause a large number of deaths of yeast, or the loss of a large amount of nutrients due to the action of running water.

Late stage of yeast expansion

We considered two simple situations—nutritionally adequate and nutritionally deficient. The nutrient-sufficient type of yeast will not die because the soil provides enough nutrients, and it will continue to grow until the concentration reaches the maximum; the nutrient-deficient soil provides limited nutrients, so as the growth of yeast, The nutrients are gradually consumed and the metabolites gradually accumulate to inhibit the growth of yeasts, so the death of yeasts occurs in areas with higher concentrations in the late stage.

Parameter definition

Diffusion model:

Calculate the gradient

We use Fick's first law, which is a classic equation describing the diffusion of particles in a solution. Assuming that the diffusion flux J from the high solute concentration area to the low concentration area, the size is proportional to the concentration gradient Δu, we can write:

in two dimensional space, it is

Where D is the diffusion coefficient, which is an inherent property of the solute and the medium. A large value of D indicates a higher diffusion rate. J is a two-dimensional vector that can be decomposed into several directions. Therefore, we can calculate the diffusion rate in any direction.

Derivation of the diffusion equation:

We consider a micro-element ΔS of the system at position (x, y), then the amount of yeast that diffuses into the area element ΔS in time ΔT is:

On the other hand, the amount of yeast flowing out of the area element ΔS in the modified time is:

At the same time, the yeast biomass of the area element changes with the progress of the diffusion process. The specific concentration of this area is defined as the biomass required for the area element per unit area to change by one unit concentration, and the mathematical expression is as follows:

From the law of conservation of matter：

in two dimensional space, it is

Substitute Q₁, Q₂, Q₃ to get:

Where:

Use the finite difference method to solve the diffusion equation:

Suppose the growth area of yeast is the rectangular area of l×S , and the solution

area is discretized into N×M units:

therefore, u(x,y,t) is rewritten as u_i,j,k

At the node, the equation is simplified to:

Finally, the difference format of the equation is:

Where:

Yeast growth model:

It is obvious that yeast does not have structures such as flagella, so it is far from enough to use the diffusion model for the expansion process. Yeast will continue to grow while diffusing to a low concentration due to the concentration difference, resulting in an increase in the biomass of yeast per unit area, so we introduced a yeast growth model.

Like the previous model, we use the Logistic equation to simulate the growth of yeast. The equation is as follows:

After discretization, we can get:

Simulation of results

Finally, we get the difference form of the yeast expansion equation as:

We show the simulation results through MATLAB:

The picture shows the ideal situation: the soil has enough nutrients and the yeast has a high growth rate. In the initial stage, we assume that a certain biomass of yeast is sprayed in a circular area, so the initial colony is a circular area. At first, the yeast tried to expand, but it has not reached the maximum biomass. We can see a clear gradation (from blue to yellow to red) in the initial stage of yeast expansion. Due to fast growth, the grid in the center has reached its maximum carrying capacity, showing a dark red color. At first, the grid on the edge of the circle showed that the biomass of yeasts was small, and it was dark blue. However, with sufficient nutrients, as the yeasts continued to spread and grow, the edge area was eventually covered with yeasts and turned into dark red.

The picture shows the situation of nutritional deficiencies. As in the ideal case, the initial colony is a circular area. In the process of yeast expansion, the biomass of yeast in the central area gradually increases and gradually expands to the surroundings. But when the yeast growth in the central area reaches maximum, due to the consumption of nutrients and the accumulation of metabolites, the yeasts in the central area with a higher biomass gradually begin to die at the beginning . Therefore, we can clearly see that the the color of center area changed from blue to red at the beginning and finally turned to blue again.

reference

1. Papacek, S.; Matonoha, C.; Stumbauer, V.; Stys, D., Modelling and simulation of photosynthetic microorganism growth: random walk vs. finite difference method. Math Comput Simulat 2012, 82 (10), 2022-2032.

2. Bussemaker, H. J.; Deutsch, A.; Geigant, E., Mean-field analysis of a dynamical phase transition in a cellular automaton model for collective motion. Phys Rev Lett 1997, 78 (26), 5018-5021.

Promoter model

Background

Machine learning methods have been widely used in various biological research fields, such as protein structure and stability prediction, RNA secondary structure prediction, promoter recognition and structure analysis. At present, there is no database related to pheromone responsive promoters. The lack of systematic characterization data greatly limits the application of pheromone responsive promoters in gene circuits.

Abstract

Based on the achievement built by IGEM team last year who developed a model to predict the relationship between promoter sequences motif pheromone response element (PRE-box) and promoter activity using MATLAB, this year we selected pheromone inducible promoters, constructs a library, and uses the characterization data in the database to train the Machine learning model, so as to attempt to find out the sequence-based rule of the activity of pheromone inducible promoters. The model includes Random Forest (RF) model, Convolutional Neural Network (CNN) model, Bidirectional Gate Recurrent Unit (Bi-GRU) model. In addition, we make a further improvement of building a hybrid model, Bi LSTM-CNN model, eventually selected the best preformed model among all the model we build using accuracy, loss curve and ROC curve.

1 Technical route

Our prior goal is to find the biological rule of how the PRE-box motif affect the promoter activity, thus we used K-means to stratify our data, and then using three main data pre-processing methods to convert the data into fitful form for four models. We eventually chose Random Forest model, Convolution Neural Network model, Bidirectional Gate Recurrent Unit (Bi-GRU), and the hybrid model, the Bidirectional Long-Short Term Memory (Bi-LSTM)-Convolution Neural Network (CNN). The reason why we chose so many models is they have their own specialties, we think as there were no predecessors who have encounter this field, we should try more possible solutions to enlarge the chance we solve the problem.

Then we used 5-fold cross validation method to train our model. Using Accuracy-Loss curve and ROC curve to evaluate our model's performance. In the future, we will use the new, more ample data to feed our best model in order to get more reliable result.

[Figure 1] The technical route we used.

2 Data pre-processing

As we attempt to find out the specific relationship between the PRE-box and the promoter activity, we selected 213 promoter sequences that are available for expression from about 7000 origin data set that provided by the wet lab. The origin data set contains lots of useless data for our project. For example, some of the sequences encompass PRE-box motif, whereas this part of the promoter sequence doesn't express. That needs to be screened by experiments otherwise it will affect the data's accuracy.

While we get our final data set (213 sequences), the first thing we worried is the lack of data. It is well known that whether machine learning or deep learning, they both need tons of data to feed the model as they need plenty training to predict a convincible result.

5-fold cross validation

we used 5-fold Cross-Validation in order to fully use our data. We first randomly divided our data into 5 sets, then chose one set to be the test set, the rest of them be the train set. While this fold is done, the second set of data will be the test set, the other become the train set. In this case each set has been chosen to be both the test set and the train set separately, thus the single data set is used five times.

[Figure 2] The concept of 5-fold Cross Validation we used.

Due to the insufficiency of data, which set to be the test set is much more crucial, so we first stratified the data and make sure that each set contains all categories we defined. The classifier we want can divide the promoter sequence by PRE-box in to 5 categories, separately meaning: very weak, weak, medium, strong, very strong. In the future when we have sufficient data, we might increase the number of category and fit a regression curve.

K-means

As promoter intensity vary greatly from each other, although every digit has mathematic relationship, the computer will still see them as independent classes. So separating intensity data to classes as part of pre-processing is required. Because the intensity data is not distributed uniformly, we can't use an even classification. To find a statistically proper way to classify intensity data, we use the K means Cluster to classify the uneven data.

[Figure 3] The algorithm flow chart of K-means

Despite K means Cluster is commonly used for 2D data, 1D data like ours is still suitable for it. After tuning the cluster number, we choose to use 9 cluster classifications first and then merge the 5 small classes into a new one. That's because when cluster = 9, the big classes where data is more aggregate are separated fittingly, the small classes contain one or two digits due to relatively big numbers. We merge the small classes with big numbers into one new class, representing “very strong” intensity.

Finally, we divided intensity data to 5 classes, each representing “very weak”, “weak”, “medium”, “strong” or “very strong”.

ROC

The method we chose to evaluate our model is to make Receiver Operating Characteristic curve (ROC curve). It is one of the most commonly used indicators in the machine learning community which has great capacity for evaluating the classification model. The classification model (classifier) is capable for categorization of data and divide them into different sets (classes). For our model, we based on the origin Binary classification ROC curve developed a multiple classification ROC curve for our model.

First, we should understand the basic theory of the Binary classification of ROC curve. While we training a dataset classifier, the predict result may equal to the actual result, or it may not. And there are two actual result which are true and false. It is obvious that there are four possible situations may occur, False-positive (FP), False-negative (FN), True-positive (TP), True-negative (TN). Most commonly used evaluation indictors for the two-category problem are precision and recall. The precision rate is defined as:

The recall rate is defined as:

Moreover, F is the weighed harmonic average (WHA) of the precision and recall:

While a=1,the F ₁ score is the harmonic mean of the precision rate and the recall rate:

The higher the F\F₁ score, the more accuracy the classifier it is.

In the statistics, there are two methods to calculate the F\F₁ score, which are 'macro' and 'micro'. As our dataset is not balanced and it will affect the final result, we choose to use 'micro' as it is more fitful for our model evaluation.

3 Model

3.1 Random Forest Model

Random Forest model offers outstanding method for working with missing data and thousands of variables, some reference illustrated that the random forest model shows great effect on classifying data.

3.1.1 Characteristics extract

Based on the data we obtained. We used Python script to count the characteristics that we think should be considered to analyze: Number of +consensus PRE, Number of – consensus PRE, Total consensus PRE, Number of +non-consensus PRE, Number of -non-consensus PRE, Total non-consensus PRE, Total PRE, promoter length (BP). After this we collected the new characteristic data and saved in the excel.

[Table 1] The characteristics and information we collected using Python

3.1.2 Model structure

[Figure 4] The structure of Random Forest model.

[Figure 4] is the structure plot we made to explain how our Random Forest model works, more detailed model construction process can be portrayed as:

(1) n new training sets are sampled from the original data by bag sampling method.

(2) For each new training set, the corresponding regression tree model is established and the predicted value is printed out.

(3) After n regression tree models are established, the final predicted value of the model is the average value of all regression tree outputs.

In addition to the characteristics of relatively high precision and fast efficiency, Random Forest model can also reverse evaluate the relative importance of input parameters to the target value according to the node separation results. This characteristic is often used in parameter sensitivity analysis. When a node is separated, it will cause the decline of separation evaluation criteria. For a specific input variable, the total number of trees can be used to normalize the total decline in all regression trees to evaluate the relative importance of parameters and impurity evaluation index, translate to mathematical language is:

I_jf is the importance of node j for feature f in the decision tree;C_j is the impurity evaluation index for node j ; M_l and M_r respectively are sample number of the left and right subset of node j ; M_j is the total sample number of nodes j ; C_l(j) and C_r(j) are respectively the impurity evaluation index of the left and right subset.

The relative importance can be express as:

n is the total number of the regression tree; j is the total number of nodes in the decision tree.

3.1.3 Result and evaluation

[Figure 5] The mean ROC curve of random forest in 5 folds.

The ROC curve shows that expect for class 0, the other classes' results seem to be lack of reliability, as their AUC is less than 0.5. Moreover, the total random forest AUC is only 0.38, which means for most classes. this model is not reliable.

[Table 2] The mean accuracy of random forest model after 5-fold of each characteristic.

The result of Random Forest predicted is [Table 2], it can be seen that the accuracy is relatively low, even the highest feature can merely reach 0.3. So how would this happen?

The most probable reason we concluded is the problems with the selection of features. While we selecting features, we can't tell which feature to choose to have a viable effect on promoter activity, and how many features are enough. Feature number too small causes inaccuracy, feature number too big causes redundancy. The selection of feature is important but hard to deal with, so we tried next model to skip this impediment.

3.2 Convolution Neural Network (CNN) Model

In order to solve the feature selection problem, we used Neural Network models as they are capable for selecting features, not by human assumption. Several questions mentioned above can be avoided by this method.

Among neural network models, Convolution Neural Network (CNN) model has the best performance on characteristic feature selection thus we first apply this model to our project.

3.2.1 Data pre-processing

However, there are a huge problem needed to be solved before we building the CNN model, as the sequence data we collected is not unified (length is inconsistent), whereas the neural network model needs normalized data, so we first find the longest sequence and then filling meaningless 0 to the other sequences until the length equal to the longest. Then using one-hot encoding method to convert the promoter sequence to a 4×N array [Figure 5]

[Figure 6] Using one-hot encoding method to convert bases into digit 0 and 1, N represent the number of bases of the longest sequence, the green area represents the replenished irrelevant digits.

[Figure 7] The deeper the colour of the green area, the more the data is convoluted.

But as the kernel of CNN has a feature that it will slide over the data one kernel set by one, and our kernel is vertical sliding over the data, so the convoluted data set is one-dimensional so only the data after the length of kernel will be fully convoluted, the data on the front side and end side would be convoluted less than the others [Figure 7], which would lead to biased result. As a consequence, we constructed convolution shells at the fringe of the sequence data. The convolution shell is a set of none-relevant digits which used to prevent the situation mentioned above. [Figure 8]

[Figure 8] Valid convolution processed data.

After data processing, our CNN model is constructed by following stages: CNN stage, dense stage and active stage.

3.2.2 Model structure

CNN stage

In the convolution stage, there will be multiple kernel K slide over the embedded array, where the K∈i^n×k, k is the fixed-length set in Embedding stage:

Where x_i is the feature of number j, l represents the number of the layer; K_j is the convolution kernel number j; * represents convolution operation; b^l_j represents the bias of feature number j on the layer l .

The activation function we used is ReLU, the function equation is as follow:

Dense stage (Full Connection stage)

Dense stage can integrate the feature values that last layer provide, then judge whether it is valid. It can turn the fixed length vectors into specific length vectors (n).

We chose n=1, so the vector becomes a real number.

Active stage

At this stage we could reflect the real number last layer produced into a zero to one interval ([0,1]). As we attempting to build a multiple classifier, the activation function we chose were SoftMax, it can be express as follow:

Where σ(z)_i is the value after SoftMax standardization. z_i is the value that needs to be transformed. Basically this function can standardize the input data and scale them to the same order of magnitude which is below one and higher than zero.

In general, the structure of CNN model is [Figure 9].

[Figure 9] The structure of CNN model.

3.2.3 Results and evaluation

[Figure 10] Mean Accuracy-Loss curve of CNN in 5-fold

The Accuracy-Loss curve shows that the training accuracy is really high (near 1), whereas the mean test accuracy of 5 folds is only 0.4883, it means that our model learned well but do not fit the test data. Overfitting tend to be the most probable reason for the biased test accuracy. But after we tuned the parameters that might help, the result remains the same. So for further improve we might need more data from the wet lab.

3.2.3 Results and evaluation

[Figure 11] The mean ROC curve of CNN in 5 folds.

The ROC result of CNN is obviously better than random forest model, but for class 3,4,5 the AUC (area under curve) is zero. This is most likely because of lack of data as the class 0 and class 1 together have almost two-thirds of all data.

While CNN results are better than Random Forest, we think, for our dataset, there should still have room for improvement. After discussion, we think a key factor that CNN model lacking is that it can't find the sequential relationship between PRE-boxes. By transfer sequence to a 4*N array, the convolution kernel is good at translate sequence data to digital data that machine can understand and easier to find the features for each PRE-box. While sequential relationship between PRE-boxes is proven by experiment to have effect on promoter intensity, and this is omitted by CNN. Consequently, we need a new neural network model for sequential relationship of PRE-boxes.

3.3 Recurrent Neural Network: Bidirectional Gate Recurrent Unit

RNN model has better ability to obtain the characteristics than CNN. There are many models based on RNN, but before we chose which model, the prior thing is to convert the data.

3.3.1 Wrod2Vec

In order to find the sequential relationship, we need a method for data processing that translate gene sequence to digital data. This procedure is called word embedding. One of the commonly used method is One-hot encoding. But One-hot encoding has several drawbacks, mainly being that the processed data are independent from each other, with no digital relation, and it produces a lot of redundant data and thus make the calculation slow.

We choose the Word2Vec method. It's one of the most popular ways to process natural language, and it's proven to be successful on a variety of downstream natural language processing tasks. Gene sequences are similar to sentences in an algorithmic way.

Word2Vec turns words to sets of digits, which are called vectors. And use digits in vectors to represent similarities. Now we face a conundrum similar to what we encounter in Random Forest: Which vectors to choose to ensure chosen vectors are viable enough to be calculated, and how many digits to be put in a vector that's not to small to be inaccurate or too many to slow the whole process. We deal this problem using the same logic: We don't. Instead of choosing vectors by hand, we use neural network to do this for us. That's what Word2Vec essentially is: Setting up a fake problem, train it with neural network, and as a side effect, we get what we want in the first place, to convert words from natural language to digital data as vectors. The Word2Vec process itself is a Neural Network model. This technique to compute word embedding is called self-supervised learning.

[Figure 12] The structure of Word2Vec (Two types)

So in our case, we first extract PRE-box from sequences and number them from 1-38, considering both direction and consensus/non-consensus. And arrange them exactly according their position in the sequence, making a “sentence” with a vocabulary of 38 words, filling to same length with 0. Then put the processed sentence to word2Vec fake training model to get the vectors.

After we convert sequences to vectors, we can use the vectors to generate Weight Matrix as the embedding layer of RNN model.

Because in our 213 sequences, the most PRE-box number is 43, which means if we see gene sequence as a natural language sentence, the longest of them has only 43 words. It's considered a short sentence. Accordingly, we choose a fitting type of RNN model for this: The Bidirectional Gate Recurrent Unit model (Bi-GRU). It is more efficient in processing short sequences like ours. And because direction is a significant part of gene sequences, we use GRU bidirectionally, which means process the sequence backwards while still sees it as a normal sequence.

3.3.2 Model structure

[Figure 13] The structure of bidirectional GRU model

[Figure 13] is the structure of the Bi-GRU model, the corresponding relationship is as follow:

Where R_t represents the reset gate, if remove this part, the model will omit the history information, Z_t represents the update gate, which is the amalgamation of the input gate and forget gate of LSTM model. The update gate can control the history information's influence on the hide layers' output at the moment. If the value reach 1, then it will pass the history information to the next node.

3.3.3 Result and evaluation

[Figure 14] The mean Accuracy-Loss curve of Bi-GRU model in 5 folds

The [Figure 14] shows that at the initial stage the loss decayed sharply, whereas after trained for a few epochs, the value tends to remain the same, that is the training can't improve the model any more. The lack of data still is the most probable reason that caused this result.

[Figure 15] Mean ROC curve of Bi-GRU model in 5 folds.

The ROC plot shows that for the feasible class (0 and 2), the discrepancy is not very large, this could mean that GRU model is not very data-relative. Like CNN model, there still have three classes that have no value (nan).

RNN is mainly fucus on finding the sequential relationship, but not good at finding PRE-box features separately.

3.4 Convolution Neural Network—Bi-directional LSTM Model

Due to the context dependence of promoter sequence in structure, and the CNN model is suited for finding features with mathematic operation whereas RNN model has a strong ability to find the correlation of sequence variables and find keywords, so we try to combine the advantages of the two models, making a new mixed model to solve this problem, that is Convolution Neural Network (CNN) hybrid with Bi-directional Long-Short Term Memory model (Bi-LSTM). the CNN- Bi-LSTM model.

The reason why we chose LSTM model as the additional part of the whole model is after CNN process the data, the output data of CNN is actually much larger than the origin, and LSTM model is expert in processing long sequences.

3.4.1 Model structure

[Figure 16] The structure of a single LSTM unit

[Figure 16] is the structure of single LSTM module, The input of LSTM model is C_t-1, h_t-1, x_t.Output isC_t, h_t. The corresponding relationship is the following formula:

Where, x_t is the input data, h is the output value,h_t-1, is the previous output value, and f,o,i are three values between [0,1], respectively represent the probability that the information is discarded (forget gate), the probability that the information is output (output gate), and the probability that the information is updated (input gate). 0 indicates that all information is abandoned or output or updated, and 1 indicates that all information is retained or not output and not updated. Through the tanh function, we can know which parts need to be updated, and C'_t, represents the value to be updated (old cell state). And C_t-1 represents all the information transmitted into the LSTM model at the last time, so the output information is the result of equation (5), which is recorded as C_t (new cell state). From equation (6), we can get that the final output value h_t, ,the product of C_t, after tanh function processing and information output probability.

[Figure 17] Bidirectional LSTM model (the circle refers to a single LSTM unit)

3.4.2 Result and evaluation

[Figure 18] The Mean Accuracy-Loss curve of CNN-Bi-LSTM model in 5 folds

The Accuracy-Loss plot shows that the hybrid model is has the best mean test score among the models we tried, whereas it's also obvious that the train accuracy remains at a relatively low level, besides, the Loss value barely changed during training, we tried to switch the learning rate by using learning rate decay method, but the result still seems to be the same. Consequently, we asked our PI for help and the probable reason tend to be the lack of data. In the future after our wet lab members enrich the promoter activity data for PRE-box, we will try the new data set and further ameliorate the model we built.

[Figure 19] The mean ROC curve of CNN-Bi-LSTM model in 5-folds.

The ROC plot shows that the hybrid model is slightly more precise than CNN model as the total value is a bit higher, but it's apparent that this model needs more data as the value of class 1 is nan, just as the other classes that has little data.

3 Model

[Figure 20] The total ROC curve of all models.

What should be noted is that data form these pictures are selected to represent the model in general. Due to the nature of Neural Network, the results necessarily come with randomness. And in the process of selecting a better model. Bidirectional GRU shows instability more than others while the mixed model, CNN-Bidirectional LSTM stays most consistent. We assume this is because that the way we process data for Bi-GRU, extracting PRE-box as key words, shortens the data, thus caused instability. As a matter of fact, lack of data is the source of some of the biggest obstacles we encounter. Being limited to 213 data rows cause consistent overfitting and loop training. We attempted to use data argumentation, which means using data at hand as templates to generate more data. But considering the nature of gene sequences, we cannot predict the influence to intensity if we change PRE-box. For example, doubling the PRE box doesn't mean doubling of the intensity.

Another problem that's likely caused by the lack of data is that the effect of changing the model doesn't viably reflect on the results, as changing the model produces the same results. A bigger dataset is required for the project to move further.

Despite all that, this whole modeling process does give some interesting feedbacks. Firstly, Neural Network has proven to have a clear advantage over traditional machine learning especially at biology and gene genre, for it can provide insights for problems that still not fully understand by today's science.

A real example that happened during the modeling development is the Word2Vec method of Bidirectional GRU. In this process we need to convert PRE-box to a set of meaningful vectors. Due to little understanding of the mechanism of how the PRE-box affect promoter intensity, we can't do this manually. So, we decide to use Word2Vec, to use a Neural Network to generate vectors for us. The critical fact that needs to understand is: we don't know what each digit in these network-generated vectors stand for. That's the two sides: if we make vectors by hand, we can't ensure the accuracy. If we generate vectors with Neural Network, we don't know what each digit stands for. But that's the beauty of neural network: it can provide insights and guide experiments. Let's say after Word2Vec we see that the third in vectors has only two values, 0 and 1. So we can assume that the third digits stand for a binary condition, for example, has a certain PRE-box at a certain place, or not. And we can do experiments to prove it. What's often the case is that it's not the experiment itself is difficult, it's deciding what experiment to do that's difficult. And that is where we believe the significance of modeling in situations like this: to find patterns that are hard for humans to do it directly and then provide insights of the matter and guide the next step exploration.

Reference

1. Kotopka, B. J.; Smolke, C. D., Model-driven generation of artificial yeast promoters. Nat Commun 2020, 11 (1), 2113.

2. Sengupta, P.; Cochran, B. H., The PRE and PQ box are functionally distinct yeast pheromone response elements. Mol Cell Biol 1990, 10 (12), 6809-12.

3. Rozenwald, M. B.; Galitsyna, A. A.; Sapunov, G. V.; Khrameeva, E. E.; Gelfand, M. S., A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci 2020, 6, e307.

4. Zhang, Y. Q.; Qiao, S. J.; Ji, S. J.; Li, Y. Z., DeepSite: bidirectional LSTM and CNN models for predicting DNA-protein binding. Int J Mach Learn Cyb 2020, 11 (4), 841-851.

5. Oubounyt, M.; Louadi, Z.; Tayara, H.; Chong, K. T., DeePromoter: Robust Promoter Predictor Using Deep Learning. Front Genet 2019, 10.

6. Deng, Y. P.; Wang, L.; Jia, H.; Tong, X. Q.; Li, F., A Sequence-to-Sequence Deep Learning Architecture Based on Bidirectional GRU for Type Recognition and Time Location of Combined Power Quality Disturbance. Ieee T Ind Inform 2019, 15 (8), 4481-4493.