Problem Restatement
In the design of synthetic biology experiment, we wonder (1)if the addition of Linker and Dockerin domains to the modified enzyme will affect the activity of the enzyme? (2)If the scaffoldin we designed will impair enzyme activity due to the steric hindrance? Because little useful structural information is available, we use mathematical modeling to give some explainations.
The key problem we need to solve is the protein structure prediction. When we looked up the literature, we found that the amino acid chain of each protein was twisted, folded and wound into a complex structure, and it usually took a long time to crack this
structure, even difficult to complete. So far, although billions of proteins have been sequenced, only 100,000 proteins have been resolved structurally. Thus, it is of great challenge to predict protein structure from amino acid sequences. Until recently, AlphaFold2 was developed using new neural network-based model and it has surprising capability to predict protein structure with high accuracy. Fortunately, DeepMind made the source code available on GitHub in July, and our team
used both AlphaFold2's ideas and code for modeling.
Apart from AlphaFold2, we also referred to the prediction models of many other research teams, and designed our own model based on the sequence characteristics of the proteins used by our team.
Similar methods have been used to investigate if the addition of Linker and Dockerin domains to the modified enzyme will affect the activity of the enzyme itself? In this case, we need to accurately predict the protein structure with the absence and presence of linker and Dockerin respective.
Choice of NLP Models
First, we developed our protein structure prediction model, which was improved compared to AlphaFold2. In the sequence feature alignment and classification part of AlphaFold2, we choose Bert, a model in the field of natural language processing.
Considering that AlphaFold2 uses NLP model Transformer, let's explain what it does with feature sequence classification. Transformer is a NLP model composed of several Encoder and several Decoder, and each Encoder and Decoder is composed of multiple blocks.
The general workflow of Transformer is as follows [1].
First, we take the representation vector of the particular amino acid segment in the input amino acid chain, denoted as vector X. Vector X is obtained by adding two parts, which are the characteristics of the amino acid segment and the position of the
segment in the complete amino acid chain. At this point, the number of rows n of the vector matrix jointly composed by multiple vectors X is the number of amino acid segments to be analyzed, and the number of columns D represents the
vector dimension denoted as $X_{n, D}$.
For the location of amino acid segments, PE calculation is used in Transformer instead of training.
$$PE_{pos,2i}=\sin (\frac{pos}{10000^{\frac{2i}{d}}})\\ PE_{pos,2i+1}=\cos (\frac{pos}{10000^{\frac{2i}{d}}})$$
After that, we use the matrix X as the Encoder input. The input to each Encoder block is either the initial matrix X or the output of the previous Encoder block. After the complete Encoder section, we get the output coding matrix C. The dimensions of
C are exactly the same as the dimensions of the input matrix X.
But for a single Encoder block, we need to explain the specific self-attention structure inside.
We assume that the input for self-attention is the matrix X, but the values used in the self-attention structure are Q, K, and V. So we need to do something to make X relate to Q, K, and V. So the method adopted here is to use a simple linear change:
After constructing the values of Q, K and V, we can calculate the relevant values through the formula:
$$Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$$
(However, the calculation step of this formula is different from that of normal formula, so extra attention should be paid to it)
The formula above computes the matrix Q times the matrix K transpose. To avoid calculating too large a number, we set the arithmetic square root of the vector dimension as the denominator. Furthermore, the new matrix Q times K transpose can effectively
represent the attention intensity between amino acid segments. Finally, after softmax, the sum of the values of each row of the matrix is 1. The resulting new matrix is multiplied by V to get the output [2].
In practice, we usually use multiple self-attention modules. We combine their outputs, and through a series of linear changes we get an output matrix that has the same dimensions as the input matrix.
In addition to the self-attention part, an Encoder block also contains Add, Norm, and Feed Forward layers.
The Add layer uses residual connection, that is, $X+ mutil-head Attention(X)$. This will help us to see quickly and better which parts have changed significantly.
The Norm Layer refers to Layer Normalization, which converts input from neurons in each Layer to values with the same mean and variance. Such an operation can speed up convergence very effectively.
The Feed Forward layer is a fully connected layer of two layers. The activation function of the first layer is Relu and the activation function of the second layer is not included. It is expressed by the calculation formula as follows:
$$ \max (0,xW_{1}+b_{1})W_{2}+b_{2} $$
This way, an entire Encoder block has been built. An Encoder is a combination of blocks.
Decoder is very similar to Encoder, with a big difference being that the first multi-head Attention layer of Decoder uses the Masked operation. This is done so that the segment we are currently working on is not affected by subsequent segments.
Figure1. Encoder and Decoder Architecture in Transformer
However, we choose to adopt Bert model for the following reasons[3].
1. The training time of Transformer model is too long, and the adjustment of parameters is very tedious;
2. Bert pretrains the deep bidirectional representation from unlabeled text by jointly adjusting left, right, and upper segments in all layers, which is very consistent with the internal connection of amino acid chain;
3. Through fine-tuning, even if the downstream task can provide very small amount of data, it can still use the pre-training model to get good training effect.
2. Bert pretrains the deep bidirectional representation from unlabeled text by jointly adjusting left, right, and upper segments in all layers, which is very consistent with the internal connection of amino acid chain;
3. Through fine-tuning, even if the downstream task can provide very small amount of data, it can still use the pre-training model to get good training effect.
Here, we focus on explaining the fine-tuning mechanism in Bert. The difference with feature-based classifier is that after the new classifier is trained, the top several layers of feature extraction layer should be thawed, and then the joint training
with the classifier should be conducted again. The reason why it is called fine-tuning is that the variation of the training updated parameters based on the pre-trained parameters is smaller than that of the pre-trained parameters [4].
This relative refers to the case where model parameters of downstream tasks are initialized with no pretraining model parameters. There is also the case that if you have a large number of data samples to train, then you can unfreeze
all the feature extraction layers and all the parameters are trained, but because it is based on pre-trained model parameters, it is still much faster than training all the parameters randomly. Therefore, to some extent, the training
process of Bert model is still long.
Figure2. Structural Comparison Between Bert and Other NLP models
Bert uses a bi-directional Transformer architecture. OpenAI GPT uses a left-to-right Transformer. ELMo uses left-to-right and right-to-left for independent training, and then splices the output to provide sequence characteristics for downstream tasks.
Among the above three model architectures, only the representation of Bert model takes the left and right context information into account at each layer. So in comparison, Bert can certainly take into account more aspects when introducing
specific amino acid sequences that we use.On this basis, combined with the rest of the structure of AlphaFold2, we began to train the model [5][6].
The hardware configuration we used is
1. CPU: Intel I9-10900KF, 10 cores, 20 threads.
2. GPU: Tesla V100S PCIe 32GB * 4
3. OS: Ubuntu 20.04 LTS
4. CUDA: 11.3
For data storage, we use about 3 TB SSD disks. We have used Apple M1 chip in our previous attempts at data preprocessing. Maybe based on the large number of cores, it processed quickly. However, due to power control and subsequent GPU use, the apple M1
chip was not decided.
2. GPU: Tesla V100S PCIe 32GB * 4
3. OS: Ubuntu 20.04 LTS
4. CUDA: 11.3
Next, we take cellulase EGL7 used in this project as an example to make analysis.
Figure3. The sequence coverage diagram of EGL7.
Figure4. The first round of EGL7 structure prediction.
In sequence Coverage diagram and IDDT score curve, it is easy to see that when position is around 400 and 460, the accuracy rate decreases significantly. When we compared and observed the sequence alignment diagram, we found that these two places were
the front sequence connecting Linker and Dockerin respectively.
Despite much efforts on the natural language processing model, we found the prediction accuracy of some fragments is still very low. The phenomenon is observed repeatably and it indicates that it is not accidental so we need to improve the segments
with low prediction accuracy [7][8].
Data Enhancement
In the face of the low prediction accuracy of some regions, visual data showed that there were only a few similar sequences in the database for regions with low accuracy in the early sequence alignment. As a result, accuracy naturally declines. In view
of this problem, we carry out data enhancement to alleviate the disadvantages caused by this problem. Our main operation steps are as follows. Firstly, we extracted the distance matrix and characteristic relation matrix of the amino
acid chain to be optimized based on the specific receptive field. Then, the segment to be optimized is encoded by AE (Auto Encoder) to facilitate similarity calculation [9]. Then based on Generative Adversarial Networks (GAN) data enhancement [10].
Finally, the weight of the enhanced data is strengthened and the new epoch is entered.
Let's focus on explaining the operations in step 2 and step 3. So what is Auto Encoder? Auto Encoder is essentially an image compression algorithm. Why compress a perfectly good picture? The main reason is that sometimes neural networks need to accept
a lot of input information. For example, when the input information is a high-definition image, the input information may reach tens of millions. Getting neural networks to learn directly from tens of millions of sources of information
is a demanding task. Therefore, we extract the most representative information from the original picture by compressing the picture, and then put the reduced information into the neural network to learn. This will make learning easier
and easier.
So this is where Auto Encoder comes into play. The original data X is compressed and decompressed into Y. Then, the prediction error is calculated by comparing X and Y, and then reverse transmission is carried out to gradually improve the accuracy of
self-coding. As you can see, we only used the input data X from beginning to end, and did not use the corresponding data label of X. So it can also be said that self-coding is a kind of unsupervised learning. When it comes time to
actually use autocoding, you usually only use the first half of the autocoding. Therefore, when we compare many similar amino acid segments, auto coding can greatly reduce our computational workload and facilitate our similarity calculation.
For the next data enhancement operation, we mainly apply GAN to the sequence generation aspect of natural processing domain.
GAN was originally designed to generate continuous data, but in natural language processing we used it to generate sequences of discrete Tokens. Because the Generator needs to be trained using the gradient obtained from the Discriminator. Both Generator
and Discriminator need to be differentiable, which can be problematic when discrete variables are present. BP alone cannot provide training gradients for the Generator. In GAN, we make small changes to the parameters of the Generator
to make the generated data more "realistic" [11]. If the generated data is based on discrete tokens and Discriminator, the information given by Discriminator is often meaningless. Because unlike images, which are continuous, small changes
can be reflected in pixels. But if you make a tiny change to tokens, there may be no tokens at all in dictionary Space. But for our sequence generation, it can play a better effect [12].
Then we perform EM operation on some sequence segments, and find out the similarity and classification points between segments through EM algorithm. The mathematical process of EM algorithm in the iterative process is as follows [13].
First, in the case of no latent variable, we can solve it directly through the likelihood function.
$$\begin{equation} \theta=\mathop{\arg\max}_{\theta}\sum_{i=1}^{m}\log P(x^{(i)}|\theta) \label{MLE form} \end{equation}$$
There are hidden variables. Let's list the formula below.
$$ \begin{equation} {\theta}=\mathop{\arg\max}_{{\theta},{z}}\sum_{i=1}^{m}\log \sum_{z^{(i)}}P(x^{(i)},z^{(i)}|{\theta}) \label{Hidden varibles form} \end{equation} $$
Next we're going to zoom.
$$ \begin{align} &\sum_{i=1}^{m}\log \sum_{z^{(i)}}P(x^{(i)},z^{(i)}|{\theta})\\ &=\sum_{i=1}^{m}\log \sum_{z^{(i)}}Q_i(z^{(i)})\frac{P(x^{(i)},z^{(i)}|{\theta})}{Q_i(z^{(i)})} \label{EM_E_1}\\ & \geq \sum_{i=1}^{m}\sum_{z^{(i)}}Q_i(z^{(i)})\log \frac{P(x^{(i)},z^{(i)}|{\theta})}{Q_i(z^{(i)})}
\label{EM_E_2} \end{align} $$
To facilitate observation and calculation, we introduce a new distribution.
$$ \begin{equation} \sum_{{z}}Q_i({z})=1, \quad 0 \leq Q_i({z}) \leq 1 \end{equation} $$
One expression of Jenson's inequality is as follows:
$$ \begin{equation} \log (E({y})) \geq E(\log ({y})) \end{equation} $$
In our constructed likelihood function, we can write it in the form of Jensen inequality.
$$ \begin{equation} E(\log \frac{P(x^{(i)},z^{(i)}|{\theta})}{Q_i(z^{(i)})})=\sum_{z^{(i)}}Q_i(z^{(i)})\log \frac{P(x^{(i)},z^{(i)}|{\theta})}{Q_i(z^{(i)})} \label{sum to mean} \end{equation} $$
So the new distribution is easy to find:
$$ \begin{align} \begin{split} Q_i(z^{(i)})&=\frac{P(x^{(i)},z^{(i)}|{\theta})}{\sum_{{z}}P(x^{(i)},z^{(i)}|{\theta})}\\ &=\frac{P(x^{(i)},z^{(i)}|{\theta})}{P(x^{(i)}|{\theta})}\\ &=P(z^{(i)}|x^{(i)},{\theta}) \end{split} \end{align} $$
Through further transformation, we can get:
$$ \begin{equation} {\theta}:= \mathop{\arg\max}_{{\theta}} \sum_{i=1}^{m} \sum_{z^{(i)}}Q_i(z^{(i)})\log\frac{P(x^{(i)},z^{(i)}|{\theta})}{Q_i(z^{(i)})} \end{equation} $$
After the preparation, we predicted the structure of all the enzymes we used, and the results were as follows.
Figure5. The second round of EGL7 structure prediction
From the three-dimensional structure, it is easy to see that the unreasonable part of EGL7 structure is significantly reduced after data enhancement. As can be seen from the quantified IDDT chart, there are still some inaccuracies, but the average
score is satisfactory.
Figure6. The average IDDT score of two model of EGL7.
Figure7. The predicted structure of EG1
Figure8. The predicted structure of XynB
Figure9. The predicted structure of scaffoldin
After we got the spatial structure of the individual enzyme we use, we next predicted the structure of dockrin-fused enzyme,the structures are shown as following.Comparison shows that the introduction of dockrin exerted no remarkable effect on the protein structure.
Figure10. EGL7 before and after adding Dockerin
Figure11. EG1 before and after adding Dockerin
Figure12. XynB before and after adding Dockerin
We assumed that the enzymatic activity will not be affected significantly with the presence of dockrin.
Molecular Docking
After we predicted dockrin-fused enzyme structures, we carried out molecular docking between these enzymes containing dockrin and scaffodin protein using HDOCK Sever. Here, the structure of scaffodin was predicted using other prediction model. From Figure13,14, and 15, we found these two components can combine with each other smoothly. It provided some theoretical support for assemble experiment.
Figure13. EGL7 connected to scaffoldin
Figure14. EG1 connected to scaffoldin
Figure15. XynB connected to scaffoldin
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
[3] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 , 2018.
[4] Ethayarajh, Kawin. "How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings." arXiv preprint arXiv:1909.00512 (2019).
[5] Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.
[6] Tunyasuvunakool, Kathryn, et al. "Highly accurate protein structure prediction for the human proteome." Nature 596.7873 (2021): 590-596.
[7] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in neural information processing systems 28 (2015): 2017-2025.
[8] Yun, Seongjun, et al. "Graph transformer networks." Advances in Neural Information Processing Systems 32 (2019): 11983-11993.
[9] Kusner, Matt J., Brooks Paige, and José Miguel Hernández-Lobato. "Grammar variational autoencoder." International Conference on Machine Learning. PMLR, 2017.
[10] Li, Yuancheng, Yuanyuan Wang, and Shiyan Hu. "Online generative adversary network based measurement recovery in false data injection attacks: A cyber-physical approach." IEEE Transactions on Industrial Informatics 16.3 (2019): 2031-2043.
[11] Mestav, Kursat Rasim, and Lang Tong. "State estimation in smart distribution systems with deep generative adversary networks." 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm).
IEEE, 2019.
[12] Tajbakhsh, Nima, et al. "Convolutional neural networks for medical image analysis: Full training or fine tuning?." IEEE transactions on medical imaging 35.5 (2016): 1299-1312.
[13] McLachlan, Geoffrey J., and Thriyambakam Krishnan. The EM algorithm and extensions. Vol. 382. John Wiley & Sons, 2007.
[14] Yan Y, Tao H, He J, Huang S-Y.* The HDOCK server for integrated protein-protein docking. Nature Protocols, 2020; doi: https://doi.org/10.1038/s41596-020-0312-x.