Fermentation model
Background
As designed in the experiment, the yeast with the paas gene produces
2-phenylethanol(2-PE)while growing under the optimal pH value of 4.5~5.0, sufficient water,
oxygen, and growth temperature of 30℃. The yeast will continuously produce 2-PE
during the growth process, and finally the 2-PE concentration in the shake flask will reach the
saturation value.
Therefore, we will apply the growth model to the fermentation tank of industrial
production. The goal of the model is to simulate the growth of yeast in the fermentation tank
environment, the consumption of substrates and the change of product 2-PE concentration, aiming
to explore the optimal mechanism of regulation yeast fermentation, such as controlling the
appropriate feeding time, selecting the best time to harvest the product, and so on. For
simplicity, we have made the following main assumptions.
Assumptions
1. We ignore the changes of other environmental factors after the start
of fermentation
When the fermentation starts, in addition to the important restrictive factors
such as the concentration of the substrate, We assume that changes in oxygen concentration, pH
and other factors will not significantly affect yeast fermentation.
2. After the start of fermentation, the composition of the fermentation
broth is uniform
Due to the shake flask fermentation method, the entire fermentation system is in
a uniform state after the start of the fermentation, and there will be no deviation due to
partial non-uniformity.
Parameters definition
Mathematical mode
(1)Yeast growth model
It is known from the literature that among the models proposed by many scholars
to describe the growth of bacterial, the Monod equation and the Logistic equation are the most
widely used.
However, the Monod equation is mainly used to describe the growth of colonies
under the condition of slow cell growth, low density, and single substrate restriction. This
condition is too ideal. Judging from the experimental conditions and the fermentation curve of
yeast fermentation process, the influence of yeast cell biomass and product 2-PE concentration
on cell growth cannot be ignored. Therefore, the Monod equation is no longer applicable. In this
experiment, we adopt the Logistic equation to simulate the growth of yeast.
The Logistics equation is a typical S-shaped curve, which is widely used in
describing the growth of microorganisms and performs greatly. It can fit the lag phase,
logarithmic growth phase and stable phase of the yeast growth curve well .
The specific expression of the equation is:
When t=0, the initial biomass of yeast at the beginning of the
fermentation C x=C 0 is brought into the analytical solution of
the equation:
Take the logarithm of both sides of the equation to get the equation:
Parameter values μmax、Cmax obtained by
drawing.
(2) Kinetic model of product growth
According to the characteristics of the fermentation product, the
relationship between the product and the growth of the bacteria is mainly divided into the
following three categories:
①The formation of product is not coupled with the growth of the
bacteria: the product and the bacteria exist at the same time;
②The product formation is coupled with the growth of the bacteria:
the product is only produced when the bacteria grow;
③Product formation is partially coupled with the growth of the bacteria:
part of the products are formed during the growth of the bacteria, and most of the
products are formed during the stable growth of the bacteria.
For the product growth model, we use the Leudeking-Piret equation to describe
the production law of product 2-PE:
(1) When α≠0, β=0, growth coupled type
(2) When α≠0, β≠0, partial growth coupled type
(3) When α=0, β≠0, non-growth coupled type
When in a steady state, the yeast biomass hardly changes with time, that
is, C_x=C_max, dC_x/dt=0 bring in:
At the same time, when the fermentation just started, the concentration of the
product 2-PE is approximately equal to 0, that is, when t=0, P=0 is brought in, and we
get:
By plotting B(t) versus A(t), the slope of the straight line
obtained is α.
We determine the values of α and β by fitting experimental data, and determine the production
model of 2-PE.
(3)Kinetic model of substrate consumption
In the fermentation and synthesis process of 2-PE, the consumption of
carbon source mainly includes:
①Constitute bacterial components for cell growth;
②Maintain the growth of yeast and its metabolic needs;
③Synthesis of primary and secondary metabolites such as
2-PE.
This model adopts the Leudeking-Piret-like equation to simulate the consumption
process of the substrate. The mathematical relationship is:
Then we can get :
Where:
Plot A versus M(t), and the slope of the line obtained is γ.
Results and Analysis
Shake flask fermentation experiment data
Figure 1 shows the curve of yeast concentration, phenethyl alcohol production,
and substrate concentration over time during yeast fermentation.
Figure 1
It can be seen from the figure 1 that after the inoculation of the yeast, due to
the rich nutrients in the medium in the early fermentation stage, the yeast grows rapidly and
metabolizes vigorously, and the growth rate is fast, and it can quickly enter the logarithmic
growth phase. In the late stage of yeast growth, due to the exhaustion of nutrients in the
medium and the accumulation of harmful metabolites, the living environment of yeast continues to
deteriorate, resulting in the death of yeast and the decline of yeast biomass. It can be seen
that the growth of yeast basically conforms to the S-shaped growth curve, the biomass can reach
the maximum in about 6 days, and the increase in biomass inhibits the growth of yeast, which
conforms to the applicable range of the Logistics equation, so It is appropriate to use it to
describe the growth law of yeast in the fermentation process.
At the same time, in the early stage of yeast fermentation,2-PE is less
synthesized; in the logarithmic growth stage and saturation period of yeast, the synthesis of
2-PE increases steadily, indicating that the relationship between synthesis of 2-PE and the
growth of the yeast belongs to non-growth coupled type.
Yeast growth model
According to the kinetic model equation and plotting versus t, the relationship
between the two can be obtained as:
Correlation coefficient R2=0.9611, which can be calculated
μmax= 0.1449
Bring the parameters into the equation to get the kinetic model of colony growth
as:
Perform nonlinear fitting on the parameters of the data, as shown in Figure 3.
It can be seen from the figure 3 that in the early and middle stage of yeast growth, the
predicted value and the experimental value have a high degree of fit, but the deviation is high
in the late stage of yeast growth. This is because the substrate is consumed and is not enough
to maintain the basic life activities of the yeast, resulting in a significant decrease in the
biomass in the later stage.
Figure 3
Product generation model
Through data fitting, we get the relationship between [P-βB(t)]
A(t) as:
Where R2=0.9899
Bring the parameters into the equation-the kinetic model of colony growth is:
As shown in Figure 5, the shake flask fermentation curve of 2-PE shows that at
the initial stage of yeast growth, yeast growth is proportional to substrate consumption, and
the production of 2-PE is less; in the logarithmic phase of yeast growth, the Synthesis is
directly proportional to substrate consumption. This indicates that the synthesis of 2-PE and
the growth of yeast belong to a partial growth-coupled relationship.
Figure 5
Substrate consumption model
By plotting [S0-S-δN(t)] versus M(t), we get the relationship
between [S0-S-δN(t)] and M(t) as:
Where R2=0.9511
Bring the parameters into the equation-the kinetic model of colony growth is:
Figure 7 shows the fitting curve of the substrate consumption during the shake
flask fermentation process. We found that the Leudeking-Piret-like model has a great fitting
effect. The main reasons why the substrate is slow in the early stage are: the yeast has not
entered the logarithmic growth phase after inoculation, and at the same time, the production of
2-PE is small; when it enters the late fermentation period, the substrate consumption tends to
be stable.
Figure 7
Conclusion
Experiments have proved that the yeast introduced with the paas gene
has strong
biological activity and can be used as a dominant strain for the production of 2-PE.
Based on the understanding of the metabolic characteristics of yeast production
of 2-PE, yeast growth and substrate consumption, etc., a fermentation
kinetic model for the synthesis of 2-PE in shake flasks was constructed. The predicted value and
actual results can be well matched. This provides a further basis for the fermentation
production control of 2-PE.
Reference
1. Zhou, Q.; Liu, Y.; Yuan, W., Kinetic modeling of butyric acid effects on
butanol fermentation by Clostridium saccharoperbutylacetonicum. N Biotechnol 2020, 55, 118-126.
2. Wang, T.; Lu, Y.; Yan, H.; Li, X.; Wang, X.; Shan, Y.; Yi, Y.; Liu, B.; Zhou,
Y.; Lu, X., Fermentation optimization and kinetic model for high cell density culture of a
probiotic microorganism: Lactobacillus rhamnosus LS-8. Bioprocess Biosyst Eng 2020, 43 (3),
515-528.
Expansion model
Based on the second product produced by our project, it can be sprayed directly
on the land to produce 2-PE, in order to produce a fragrant smell. We tested it in the
experimental field and established a second model- yeast expansion model.
The proliferation of microorganisms in the environment has always been an
interesting topic. However, due to the complexity of the system, it is very difficult to study
precise motion. Therefore, we urgently need to establish a model to simulate the growth and
spread of yeast on the ground.
The model is divided into two parts, the growth model and the diffusion model,
and the increase in biomass at t+1 in each small area is determined by two parts: the diffusion
volume of the surrounding area and the self-growth of yeast in the area at time t. and:
Where:
u represents the biomass of yeast in a certain unit
area,u(G.N) and u(E.N) represent the growth and diffusion of yeast in time
Δt.
Model assumptions
In order to ensure the rationality of the model, we have made the following
assumptions:
Yeast growth stage
The main forces that promote the movement of microorganisms are:
1. The concentration difference of the substrate: Because of the different
nutrient concentration in different areas, the yeast will spontaneously move from the area with
lower nutrient concentration to the area with higher nutrient concentration. At the same time,
we believe that the concentration of nutrients in each area of the soil is initially equal in
this model.
2. Cells "sliding" or "pushing": When cells grow rapidly, they tend to occupy
the entire space. As the concentration reaches a certain level, they tend to push each other and
the colonies begin to expand outward.
The factors we don't consider are:
Changes in yeast concentration due to various factors in the natural
environment. For example, the temperature is too high or too low to cause a large number of
deaths of yeast, or the loss of a large amount of nutrients due to the action of running water.
Late stage of yeast expansion
We considered two simple situations—nutritionally adequate and nutritionally
deficient. The nutrient-sufficient type of yeast will not die because the soil provides enough
nutrients, and it will continue to grow until the concentration reaches the maximum; the
nutrient-deficient soil provides limited nutrients, so as the growth of yeast, The nutrients are
gradually consumed and the metabolites gradually accumulate to inhibit the growth of yeasts, so
the death of yeasts occurs in areas with higher concentrations in the late stage.
Parameter definition
Diffusion model:
Calculate the gradient
We use Fick's first law, which is a classic equation describing the diffusion of
particles in a solution. Assuming that the diffusion flux J from the high solute concentration
area to the low concentration area, the size is proportional to the concentration gradient
Δu, we can write:
in two dimensional space, it is
Where D is the diffusion coefficient, which is an inherent property of the
solute and the medium. A large value of D indicates a higher diffusion rate. J is a
two-dimensional vector that can be decomposed into several directions. Therefore, we can
calculate the diffusion rate in any direction.
Derivation of the diffusion equation:
We consider a micro-element ΔS of the system at position (x, y),
then the amount of yeast that diffuses into the area element ΔS in time
ΔT is:
On the other hand, the amount of yeast flowing out of the area element
ΔS in the modified time is:
At the same time, the yeast biomass of the area element changes with the
progress of the diffusion process. The specific concentration of this area is defined as the
biomass required for the area element per unit area to change by one unit concentration, and the
mathematical expression is as follows:
From the law of conservation of matter:
in two dimensional space, it is
Substitute Q1, Q2,
Q3 to get:
Where:
Use the finite difference method to solve the diffusion equation:
Suppose the growth area of yeast is the rectangular area of l×S ,
and the solution
area is discretized into N×M units:
therefore, u(x,y,t) is rewritten as ui,j,k
At the node, the equation is simplified to:
Finally, the difference format of the equation is:
Where:
Yeast growth model:
It is obvious that yeast does not have structures such as flagella, so it is far
from enough to use the diffusion model for the expansion process. Yeast will continue to grow
while diffusing to a low concentration due to the concentration difference, resulting in an
increase in the biomass of yeast per unit area, so we introduced a yeast growth model.
Like the previous model, we use the Logistic equation to simulate the growth of
yeast. The equation is as follows:
After discretization, we can get:
Simulation of results
Finally, we get the difference form of the yeast expansion equation as:
We show the simulation results through MATLAB:
The picture shows the ideal situation: the soil has enough nutrients and the
yeast has a high growth rate. In the initial stage, we assume that a certain biomass of yeast is
sprayed in a circular area, so the initial colony is a circular area. At first, the yeast tried
to expand, but it has not reached the maximum biomass. We can see a clear gradation (from blue
to yellow to red) in the initial stage of yeast expansion. Due to fast growth, the grid in the
center has reached its maximum carrying capacity, showing a dark red color. At first, the grid
on the edge of the circle showed that the biomass of yeasts was small, and it was dark blue.
However, with sufficient nutrients, as the yeasts continued to spread and grow, the edge area
was eventually covered with yeasts and turned into dark red.
The picture shows the situation of nutritional deficiencies. As in the ideal
case, the initial colony is a circular area. In the process of yeast expansion, the biomass of
yeast in the central area gradually increases and gradually expands to the surroundings. But
when the yeast growth in the central area reaches maximum, due to the consumption of nutrients
and the accumulation of metabolites, the yeasts in the central area with a higher biomass
gradually begin to die at the beginning . Therefore, we can clearly see that the the color of
center area changed from blue to red at the beginning and finally turned to blue again.
reference
1. Papacek, S.; Matonoha, C.; Stumbauer, V.; Stys, D., Modelling and simulation
of photosynthetic microorganism growth: random walk vs. finite difference method. Math Comput
Simulat 2012, 82 (10), 2022-2032.
2. Bussemaker, H. J.; Deutsch, A.; Geigant, E., Mean-field analysis of a
dynamical phase transition in a cellular automaton model for collective motion. Phys Rev Lett
1997, 78 (26), 5018-5021.
Promoter model
Background
Machine learning methods have been widely used in various biological research
fields, such as protein structure and stability prediction, RNA secondary structure prediction,
promoter recognition and structure analysis. At present, there is no database related to
pheromone responsive promoters. The lack of systematic characterization data greatly limits the
application of pheromone responsive promoters in gene circuits.
Abstract
Based on the achievement built by IGEM team last year who developed a model to
predict the relationship between promoter sequences motif pheromone response element (PRE-box)
and promoter activity using MATLAB, this year we selected pheromone inducible promoters,
constructs a library, and uses the characterization data in the database to train the Machine
learning model, so as to attempt to find out the sequence-based rule of the activity of
pheromone inducible promoters. The model includes Random Forest (RF) model, Convolutional Neural
Network (CNN) model, Bidirectional Gate Recurrent Unit (Bi-GRU) model. In addition, we make a
further improvement of building a hybrid model, Bi LSTM-CNN model, eventually selected the best
preformed model among all the model we build using accuracy, loss curve and ROC curve.
1 Technical route
Our prior goal is to find the biological rule of how the PRE-box motif affect
the promoter activity, thus we used K-means to stratify our data, and then using three main data
pre-processing methods to convert the data into fitful form for four models. We eventually chose
Random Forest model, Convolution Neural Network model, Bidirectional Gate Recurrent Unit
(Bi-GRU), and the hybrid model, the Bidirectional Long-Short Term Memory (Bi-LSTM)-Convolution
Neural Network (CNN). The reason why we chose so many models is they have their own specialties,
we think as there were no predecessors who have encounter this field, we should try more
possible solutions to enlarge the chance we solve the problem.
Then we used 5-fold cross validation method to train our model. Using
Accuracy-Loss curve and ROC curve to evaluate our model's performance. In the future, we will
use the new, more ample data to feed our best model in order to get more reliable result.
[Figure 1] The technical route we used.
2 Data pre-processing
As we attempt to find out the specific relationship between the PRE-box and the
promoter activity, we selected 213 promoter sequences that are available for expression from
about 7000 origin data set that provided by the wet lab. The origin data set contains lots of
useless data for our project. For example, some of the sequences encompass PRE-box motif,
whereas this part of the promoter sequence doesn't express. That needs to be screened by
experiments otherwise it will affect the data's accuracy.
While we get our final data set (213 sequences), the first thing we worried is
the lack of data. It is well known that whether machine learning or deep learning, they both
need tons of data to feed the model as they need plenty training to predict a convincible
result.
5-fold cross validation
we used 5-fold Cross-Validation in order to fully use our data. We first
randomly divided our data into 5 sets, then chose one set to be the test set, the rest of them
be the train set. While this fold is done, the second set of data will be the test set, the
other become the train set. In this case each set has been chosen to be both the test set and
the train set separately, thus the single data set is used five times.
[Figure 2] The concept of 5-fold Cross Validation we used.
Due to the insufficiency of data, which set to be the test set is much more
crucial, so we first stratified the data and make sure that each set contains all categories we
defined. The classifier we want can divide the promoter sequence by PRE-box in to 5 categories,
separately meaning: very weak, weak, medium, strong, very strong. In the future when we have
sufficient data, we might increase the number of category and fit a regression curve.
K-means
As promoter intensity vary greatly from each other, although every digit has
mathematic relationship, the computer will still see them as independent classes. So separating
intensity data to classes as part of pre-processing is required. Because the intensity data is
not distributed uniformly, we can't use an even classification. To find a statistically proper
way to classify intensity data, we use the K means Cluster to classify the uneven data.
[Figure 3] The algorithm flow chart of K-means
Despite K means Cluster is commonly used for 2D data, 1D data like ours is still
suitable for it. After tuning the cluster number, we choose to use 9 cluster classifications
first and then merge the 5 small classes into a new one. That's because when cluster = 9, the
big classes where data is more aggregate are separated fittingly, the small classes contain one
or two digits due to relatively big numbers. We merge the small classes with big numbers into
one new class, representing “very strong” intensity.
Finally, we divided intensity data to 5 classes, each representing “very weak”,
“weak”, “medium”, “strong” or “very strong”.
ROC
The method we chose to evaluate our model is to make Receiver Operating
Characteristic curve (ROC curve). It is one of the most commonly used indicators in the machine
learning community which has great capacity for evaluating the classification model. The
classification model (classifier) is capable for categorization of data and divide them into
different sets (classes). For our model, we based on the origin Binary classification ROC curve
developed a multiple classification ROC curve for our model.
First, we should understand the basic theory of the Binary classification of ROC
curve. While we training a dataset classifier, the predict result may equal to the actual
result, or it may not. And there are two actual result which are true and false. It is obvious
that there are four possible situations may occur, False-positive (FP), False-negative (FN),
True-positive (TP), True-negative (TN). Most commonly used evaluation indictors for the
two-category problem are precision and recall. The precision rate is defined as:
The recall rate is defined as:
Moreover, F is the weighed harmonic average (WHA) of the precision and recall:
While a=1,the F 1 score is the harmonic mean of the precision rate and the
recall rate:
The higher the F\F1 score, the more accuracy the classifier it is.
In the statistics, there are two methods to calculate the F\F1 score, which
are 'macro' and 'micro'. As our dataset is not balanced and it will affect the final result, we
choose to use 'micro' as it is more fitful for our model evaluation.
3 Model
3.1 Random Forest Model
Random Forest model offers outstanding method for working with missing data and
thousands of variables, some reference illustrated that the random forest model shows great
effect on classifying data.
3.1.1 Characteristics extract
Based on the data we obtained. We used Python script to count the
characteristics that we think should be considered to analyze: Number of +consensus PRE, Number
of – consensus PRE, Total consensus PRE, Number of +non-consensus PRE, Number of -non-consensus
PRE, Total non-consensus PRE, Total PRE, promoter length (BP). After this we collected the new
characteristic data and saved in the excel.
[Table 1] The characteristics and information we collected using
Python
3.1.2 Model structure
[Figure 4] The structure of Random Forest model.
[Figure 4] is the structure plot we made to explain how our Random Forest model
works, more detailed model construction process can be portrayed as:
(1) n new training sets are sampled from the original data by bag sampling
method.
(2) For each new training set, the corresponding regression tree model is
established and the predicted value is printed out.
(3) After n regression tree models are established, the final predicted value of
the model is the average value of all regression tree outputs.
In addition to the characteristics of relatively high precision and fast
efficiency, Random Forest model can also reverse evaluate the relative importance of input
parameters to the target value according to the node separation results. This characteristic is
often used in parameter sensitivity analysis. When a node is separated, it will cause the
decline of separation evaluation criteria. For a specific input variable, the total number of
trees can be used to normalize the total decline in all regression trees to evaluate the
relative importance of parameters and impurity evaluation index, translate to mathematical
language is:
Ijf is the importance of node j for feature f in the decision tree;Cj is the
impurity evaluation index for node j ; Ml and Mr respectively are sample number of the left
and right subset of node j ; Mj is the total sample number of nodes j ; Cl(j) and
Cr(j) are respectively the impurity evaluation index of the left and right subset.
The relative importance can be express as:
n is the total number of the regression tree; j is the total number of nodes in
the decision tree.
3.1.3 Result and evaluation
[Figure 5] The mean ROC curve of random forest in 5 folds.
The ROC curve shows that expect for class 0, the other classes' results seem to
be lack of reliability, as their AUC is less than 0.5. Moreover, the total random forest AUC is
only 0.38, which means for most classes. this model is not reliable.
[Table 2] The mean accuracy of random forest model after 5-fold of
each characteristic.
The result of Random Forest predicted is [Table 2], it can be seen that the
accuracy is relatively low, even the highest feature can merely reach 0.3. So how would this
happen?
The most probable reason we concluded is the problems with the selection of
features. While we selecting features, we can't tell which feature to choose to have a viable
effect on promoter activity, and how many features are enough. Feature number too small causes
inaccuracy, feature number too big causes redundancy. The selection of feature is important but
hard to deal with, so we tried next model to skip this impediment.
3.2 Convolution Neural Network (CNN) Model
In order to solve the feature selection problem, we used Neural Network models
as they are capable for selecting features, not by human assumption. Several questions mentioned
above can be avoided by this method.
Among neural network models, Convolution Neural Network (CNN) model has the best
performance on characteristic feature selection thus we first apply this model to our project.
3.2.1 Data pre-processing
However, there are a huge problem needed to be solved before we building the CNN
model, as the sequence data we collected is not unified (length is inconsistent), whereas the
neural network model needs normalized data, so we first find the longest sequence and then
filling meaningless 0 to the other sequences until the length equal to the longest. Then using
one-hot encoding method to convert the promoter sequence to a 4×N array [Figure 5]
[Figure 6] Using one-hot encoding method to convert bases into digit 0
and 1, N represent the number of bases of the longest sequence, the green area represents
the replenished irrelevant digits.
[Figure 7] The deeper the colour of the green area, the more the data
is convoluted.
But as the kernel of CNN has a feature that it will slide over the data one
kernel set by one, and our kernel is vertical sliding over the data, so the convoluted data set
is one-dimensional so only the data after the length of kernel will be fully convoluted, the
data on the front side and end side would be convoluted less than the others [Figure 7], which
would lead to biased result. As a consequence, we constructed convolution shells at the fringe
of the sequence data. The convolution shell is a set of none-relevant digits which used to
prevent the situation mentioned above. [Figure 8]
[Figure 8] Valid convolution processed data.
After data processing, our CNN model is constructed by following stages: CNN
stage, dense stage and active stage.
3.2.2 Model structure
CNN stage
In the convolution stage, there will be multiple kernel K slide over the
embedded array, where the K∈in×k, k is the fixed-length set in Embedding stage:
Where x_i is the feature of number j, l represents the number of the
layer; Kj is the convolution kernel number j; * represents convolution
operation; blj represents the bias of feature number j
on the layer l .
The activation function we used is ReLU, the function equation is as follow:
Dense stage (Full Connection stage)
Dense stage can integrate the feature values that last layer provide, then judge
whether it is valid. It can turn the fixed length vectors into specific length vectors (n).
We chose n=1, so the vector becomes a real number.
Active stage
At this stage we could reflect the real number last layer produced into a zero
to one interval ([0,1]). As we attempting to build a multiple classifier, the activation
function we chose were SoftMax, it can be express as follow:
Where σ(z)i is the value after SoftMax standardization. zi is the value
that needs to be transformed. Basically this function can standardize the input data and scale
them to the same order of magnitude which is below one and higher than zero.
In general, the structure of CNN model is [Figure 9].
[Figure 9] The structure of CNN model.
3.2.3 Results and evaluation
[Figure 10] Mean Accuracy-Loss curve of CNN in 5-fold
The Accuracy-Loss curve shows that the training accuracy is really high (near
1), whereas the mean test accuracy of 5 folds is only 0.4883, it means that our model learned
well but do not fit the test data. Overfitting tend to be the most probable reason for the
biased test accuracy. But after we tuned the parameters that might help, the result remains the
same. So for further improve we might need more data from the wet lab.
3.2.3 Results and evaluation
[Figure 11] The mean ROC curve of CNN in 5 folds.
The ROC result of CNN is obviously better than random forest model, but for
class 3,4,5 the AUC (area under curve) is zero. This is most likely because of lack of data as
the class 0 and class 1 together have almost two-thirds of all data.
While CNN results are better than Random Forest, we think, for our dataset,
there should still have room for improvement. After discussion, we think a key factor that CNN
model lacking is that it can't find the sequential relationship between PRE-boxes. By transfer
sequence to a 4*N array, the convolution kernel is good at translate sequence data to digital
data that machine can understand and easier to find the features for each PRE-box. While
sequential relationship between PRE-boxes is proven by experiment to have effect on promoter
intensity, and this is omitted by CNN. Consequently, we need a new neural network model for
sequential relationship of PRE-boxes.
3.3 Recurrent Neural Network: Bidirectional Gate Recurrent Unit
RNN model has better ability to obtain the characteristics than CNN. There are
many models based on RNN, but before we chose which model, the prior thing is to convert the
data.
3.3.1 Wrod2Vec
In order to find the sequential relationship, we need a method for data
processing that translate gene sequence to digital data. This procedure is called word
embedding. One of the commonly used method is One-hot encoding. But One-hot encoding has several
drawbacks, mainly being that the processed data are independent from each other, with no digital
relation, and it produces a lot of redundant data and thus make the calculation slow.
We choose the Word2Vec method. It's one of the most popular ways to process
natural language, and it's proven to be successful on a variety of downstream natural language
processing tasks. Gene sequences are similar to sentences in an algorithmic way.
Word2Vec turns words to sets of digits, which are called vectors. And use digits
in vectors to represent similarities. Now we face a conundrum similar to what we encounter in
Random Forest: Which vectors to choose to ensure chosen vectors are viable enough to be
calculated, and how many digits to be put in a vector that's not to small to be inaccurate or
too many to slow the whole process. We deal this problem using the same logic: We don't. Instead
of choosing vectors by hand, we use neural network to do this for us. That's what Word2Vec
essentially is: Setting up a fake problem, train it with neural network, and as a side effect,
we get what we want in the first place, to convert words from natural language to digital data
as vectors. The Word2Vec process itself is a Neural Network model. This technique to compute
word embedding is called self-supervised learning.
[Figure 12] The structure of Word2Vec (Two types)
So in our case, we first extract PRE-box from sequences and number them from
1-38, considering both direction and consensus/non-consensus. And arrange them exactly according
their position in the sequence, making a “sentence” with a vocabulary of 38 words, filling to
same length with 0. Then put the processed sentence to word2Vec fake training model to get the
vectors.
After we convert sequences to vectors, we can use the vectors to generate Weight
Matrix as the embedding layer of RNN model.
Because in our 213 sequences, the most PRE-box number is 43, which means if we
see gene sequence as a natural language sentence, the longest of them has only 43 words. It's
considered a short sentence. Accordingly, we choose a fitting type of RNN model for this: The
Bidirectional Gate Recurrent Unit model (Bi-GRU). It is more efficient in processing short
sequences like ours. And because direction is a significant part of gene sequences, we use GRU
bidirectionally, which means process the sequence backwards while still sees it as a normal
sequence.
3.3.2 Model structure
[Figure 13] The structure of bidirectional GRU model
[Figure 13] is the structure of the Bi-GRU model, the corresponding relationship
is as follow:
Where Rt represents the reset gate, if remove this part, the model will omit
the history information, Zt represents the update gate, which is the amalgamation of the input
gate and forget gate of LSTM model. The update gate can control the history information's
influence on the hide layers' output at the moment. If the value reach 1, then it will pass the
history information to the next node.
3.3.3 Result and evaluation
[Figure 14] The mean Accuracy-Loss curve of Bi-GRU model in 5
folds
The [Figure 14] shows that at the initial stage the loss decayed sharply,
whereas after trained for a few epochs, the value tends to remain the same, that is the training
can't improve the model any more. The lack of data still is the most probable reason that caused
this result.
[Figure 15] Mean ROC curve of Bi-GRU model in 5 folds.
The ROC plot shows that for the feasible class (0 and 2), the discrepancy is not
very large, this could mean that GRU model is not very data-relative. Like CNN model, there
still have three classes that have no value (nan).
RNN is mainly fucus on finding the sequential relationship, but not good at
finding PRE-box features separately.
3.4 Convolution Neural Network—Bi-directional LSTM Model
Due to the context dependence of promoter sequence in structure, and the CNN
model is suited for finding features with mathematic operation whereas RNN model has a strong
ability to find the correlation of sequence variables and find keywords, so we try to combine
the advantages of the two models, making a new mixed model to solve this problem, that is
Convolution Neural Network (CNN) hybrid with Bi-directional Long-Short Term Memory model
(Bi-LSTM). the CNN- Bi-LSTM model.
The reason why we chose LSTM model as the additional part of the whole model is
after CNN process the data, the output data of CNN is actually much larger than the origin, and
LSTM model is expert in processing long sequences.
3.4.1 Model structure
[Figure 16] The structure of a single LSTM unit
[Figure 16] is the structure of single LSTM module, The input of LSTM model is
Ct-1, ht-1, xt.Output isCt, ht. The corresponding relationship is the following formula:
Where, xt is the input data, h is the output value,ht-1, is the previous
output value, and f,o,i are three values between [0,1], respectively represent the probability
that the information is discarded (forget gate), the probability that the information is output
(output gate), and the probability that the information is updated (input gate). 0 indicates
that all information is abandoned or output or updated, and 1 indicates that all information is
retained or not output and not updated. Through the tanh function, we can know which parts need
to be updated, and C't, represents the value to be updated (old cell state). And Ct-1
represents all the information transmitted into the LSTM model at the last time, so the output
information is the result of equation (5), which is recorded as Ct (new cell state). From
equation (6), we can get that the final output value ht, ,the product of Ct, after tanh function
processing and information output probability.
[Figure 17] Bidirectional LSTM model (the circle refers to a single
LSTM unit)
3.4.2 Result and evaluation
[Figure 18] The Mean Accuracy-Loss curve of CNN-Bi-LSTM model in 5
folds
The Accuracy-Loss plot shows that the hybrid model is has the best mean test
score among the models we tried, whereas it's also obvious that the train accuracy remains at a
relatively low level, besides, the Loss value barely changed during training, we tried to switch
the learning rate by using learning rate decay method, but the result still seems to be the
same. Consequently, we asked our PI for help and the probable reason tend to be the lack of
data. In the future after our wet lab members enrich the promoter activity data for PRE-box, we
will try the new data set and further ameliorate the model we built.
[Figure 19] The mean ROC curve of CNN-Bi-LSTM model in
5-folds.
The ROC plot shows that the hybrid model is slightly more precise than CNN model
as the total value is a bit higher, but it's apparent that this model needs more data as the
value of class 1 is nan, just as the other classes that has little data.
3 Model
[Figure 20] The total ROC curve of all models.
What should be noted is that data form these pictures are selected to represent
the model in general. Due to the nature of Neural Network, the results necessarily come with
randomness. And in the process of selecting a better model. Bidirectional GRU shows instability
more than others while the mixed model, CNN-Bidirectional LSTM stays most consistent. We assume
this is because that the way we process data for Bi-GRU, extracting PRE-box as key words,
shortens the data, thus caused instability. As a matter of fact, lack of data is the source of
some of the biggest obstacles we encounter. Being limited to 213 data rows cause consistent
overfitting and loop training. We attempted to use data argumentation, which means using data at
hand as templates to generate more data. But considering the nature of gene sequences, we cannot
predict the influence to intensity if we change PRE-box. For example, doubling the PRE box
doesn't mean doubling of the intensity.
Another problem that's likely caused by the lack of data is that the effect of
changing the model doesn't viably reflect on the results, as changing the model produces the
same results. A bigger dataset is required for the project to move further.
Despite all that, this whole modeling process does give some interesting
feedbacks. Firstly, Neural Network has proven to have a clear advantage over traditional machine
learning especially at biology and gene genre, for it can provide insights for problems that
still not fully understand by today's science.
A real example that happened during the modeling development is the Word2Vec
method of Bidirectional GRU. In this process we need to convert PRE-box to a set of meaningful
vectors. Due to little understanding of the mechanism of how the PRE-box affect promoter
intensity, we can't do this manually. So, we decide to use Word2Vec, to use a Neural Network to
generate vectors for us. The critical fact that needs to understand is: we don't know what each
digit in these network-generated vectors stand for. That's the two sides: if we make vectors by
hand, we can't ensure the accuracy. If we generate vectors with Neural Network, we don't know
what each digit stands for. But that's the beauty of neural network: it can provide insights and
guide experiments. Let's say after Word2Vec we see that the third in vectors has only two
values, 0 and 1. So we can assume that the third digits stand for a binary condition, for
example, has a certain PRE-box at a certain place, or not. And we can do experiments to prove
it. What's often the case is that it's not the experiment itself is difficult, it's deciding
what experiment to do that's difficult. And that is where we believe the significance of
modeling in situations like this: to find patterns that are hard for humans to do it directly
and then provide insights of the matter and guide the next step exploration.
Reference
1. Kotopka, B. J.; Smolke, C. D., Model-driven generation of artificial yeast
promoters. Nat Commun 2020, 11 (1), 2113.
2. Sengupta, P.; Cochran, B. H., The PRE and PQ box are functionally distinct
yeast pheromone response elements. Mol Cell Biol 1990, 10 (12), 6809-12.
3. Rozenwald, M. B.; Galitsyna, A. A.; Sapunov, G. V.; Khrameeva, E. E.;
Gelfand, M. S., A machine learning framework for the prediction of chromatin folding in
Drosophila using epigenetic features. PeerJ Comput Sci 2020, 6, e307.
4. Zhang, Y. Q.; Qiao, S. J.; Ji, S. J.; Li, Y. Z., DeepSite: bidirectional LSTM
and CNN models for predicting DNA-protein binding. Int J Mach Learn Cyb 2020, 11 (4), 841-851.
5. Oubounyt, M.; Louadi, Z.; Tayara, H.; Chong, K. T., DeePromoter: Robust
Promoter Predictor Using Deep Learning. Front Genet 2019, 10.
6. Deng, Y. P.; Wang, L.; Jia, H.; Tong, X. Q.; Li, F., A Sequence-to-Sequence
Deep Learning Architecture Based on Bidirectional GRU for Type Recognition and Time Location of
Combined Power Quality Disturbance. Ieee T Ind Inform 2019, 15 (8), 4481-4493.