Team:TJUSLS China/Model

MODEL

Overview

Our modeling part contains a total of three aspects.

  • The modeling of the function between the peak area of HPLC and A260 to simplify the process of thermostability testing.
  • Use Alphafold to do structure modeling and use it for molecular docking with 4 polyethylene terephthalate.
  • Gain insight into the mechanism of mutants’ higher stability through molecular dynamic simulations.

Establishment of the conversion model between UV260 and HPLC Peak Area

Description of the model

The main purpose of our project is to improve the thermal stability of PETase. We need to test our optimization results by testing the degradation product of PET. The most established method in the field to detect PET degradation products is based on high performance liquid chromatography (HPLC). According to experience, HPLC is a accurate but time-consuming and expensive method.

The amount of PET degradation products at different temperatures can also be detected by measuring the absorbance of 260 nm ultraviolet (A260), so as to measure the thermostability of PET. A260 is a method that consumes less time and costs less but with less accuracy.

Through the analysis of the measurement principles of the two methods, we believe that there should be a positive correlation between HPLC and A260. However, the results obtained only by qualitative analysis are not accurate enough. We hope to establish the function between the two methods to obtain the results of HPLC through A260 quickly and accurately.

We conducted a lot of experiments to obtain the data needed to establish the functional relationship. In the choice of experimental data processing methods, we tried a variety of experimental methods including least squares, machine learning, and small neural networks. According to the amount of data obtained in the actual experiment, compared with the large demand for the amount of original data of the latter two. Here we think that adopting the least square method to fit the data is the best method.

Least square fitting process:

Suppose X and y satisfy the following relationship:

We define the standard deviation is the deviation Ri between the fitting result and the original data

After that, we can make Python compute to minimize the sum of squares of the average deviation.Finally, we get the following results:

Figure.1 the function model between HPLC and A260

Summary

  • UV absorbance at 260nm wavelength of PET degradation products is easy to be detected but can be determined with less accuracy
  • HPLC can provide accurate data about concentration but it needs a long time and complex operation. Also, HPLC cost a lot of money for every detection.
  • We collected the data and set a calibration curve of positive correlation so we can easily get accurate product concentration through our model and can quantitatively analyze the thermostability of mutants and wild-type PETase .
  • For future teams, it can be used for experiments to determine the concentration easily.

Accurately predict protein structure and do molecular docking

After the mutation, we need to detect the specificity of the new protein. Here, we first calculate the protein structure by neural network, and then dock the protein with the small molecule MHET to be decomposed. The docking results of mutant and wild-type protein will be compared to detect the specificity of the new protein.

Get protein structure

Firstly, the receptor protein needs to be prepared. In the traditional molecular docking, the receptor protein is often obtained by homology modeling. Homology modeling has a certain demand for the sequence similarity between the target protein and the template, and some results are unreasonable. Therefore, we hope to obtain protein structure by trying new prediction methods.

Alphafold2 and Alphafold are two excellent new prediction methods in recent years. We hope to select the one that is more in line with the current model establishment needs through comparison.

Alphafold's convolution neural network learning method of chemical bond information histogram for amino acids is a way to predict from local. The data processing speed is relatively slow, and the long-distance dependence of protein structure information may be ignored.

In order to solve this problem, Alphafold2 replaces all convolutional neural networks with Attention mechanism. It is a network architecture that imitates human attention and can focus on multiple details at the same time. From the results, the effect of Attention is very good. The prediction results have quite high accuracy, and the prediction speed has been significantly improved. We need to quickly obtain accurate model data, so we chose Alphafold2 to build the model.

During the operation of Alphafold2, the target amino acid sequence, MSA and template need to be used as inputs to obtain the 3D structure of protein through the "end-to-end" neural network structure. The specific prediction method mainly includes two parts: neural network Evoformer and Structure module.

Evoformer

Evoformer mainly combines Graph networks and multiple sequence alignment (MSA) to complete structure prediction. Graph networks can well represent the correlation between things. Here, it can construct a graph of protein related information to represent the distance between different amino acids. The special "triple self attention mechanism" constructed by the Attention mechanism is used to process and calculate the relationship diagram between amino acids. Multiple sequence alignment (MSA) is mainly to make the sites of the same residue in the same column, expose the similar parts between different sequences, and infer the similar relationship between different proteins in structure and function. Finally, they combined the results obtained by the two methods, exchanged the calculated amino acid relationship with MSA, and then directly deduced the pairing representation of spatial and evolutionary relationship.

Structure module

The main work of the Structure module is to convert the information obtained by Evoformer into the 3D structure of protein. The Attention mechanism is also used in this part. It can calculate each part of the protein separately, which is called the "invariant point attention" mechanism. It takes an atom as the origin, constructs a 3D reference field, rotates and translates according to the prediction information, and obtains a structural framework. Then, the Attention mechanism will predict all atoms and finally summarize a highly accurate protein structure.

Prediction results

We can obtain the results of predicting receptor protein. In the white part of this figure, we use Alphafold2 to predict the 3D structure, and green is the real wildtype. The similarity between them is as high as 92.2%, which shows that our predicted results are reliable.

Figure.2 prediction of protein structure and wild-type protein structure

The preparation of small molecule

The preparation process of small molecule ligand MHET is relatively easy. The required ligand can be obtained by ChemDraw drawing and simple format conversion of Open Babel.

Figure.3 MHET Small molecule

Molecular Docking

Figure.4 docking results of mutant and MHET small molecules
Figure.5 docking results of wild-type and MHET small molecules

After that, the molecular docking step is carried out. As shown in the figure, there is no significant change in the binding site between the protein and small molecules before and after the mutation, indicating that our protein still has the specific function that we hope to retain.

Molecular Dynamics Simulation

Since our main purpose also includes improving the thermal stability of the enzyme, here we hope to verify our results by simulation. We have two options. One is to use the static information of enzyme for prediction, the other is to use Molecular Dynamics Simulation for prediction. According to experience, the prediction results of the latter are often better. Moreover, it can also provide more molecular level information for later analysis. Therefore, here we choose Molecular Dynamics Simulation to predict the stability of enzyme at high temperature.

Basic principles of molecular dynamics simulation

The Time Evolution Algorithm

The most basic principle of molecular dynamics is Newton's second law of motion. Newton's second law of motion states that the acceleration a of a particle depends on the mass m acting on the object and the resultant force F acting on the object.

Acceleration a is the derivative of velocity v with respect to time, that is, the second derivative of position with respect to time. If we give the initial velocity and position of the particle ,then we have:

In order to solve this problem, an approximate method is needed. Suppose time t is a very short amount of time t. In a very short period of time, we consider that the acceleration a of the object is a fixed value. In this case, we can simplify the original equation to a certain extent. The equation obtained in this way has some errors due to approximation. We can use other mathematical methods to reduce the result error, such as selecting time (instead of T0) reduces the error to Δ t4 (not Δ t2) order of magnitude. This framework is at the basis of the leap-frog algorithm, which is used in the vast majority of MD simulations engines:

This algorithm can thus “solve” every possible Newton equation at the expense of precision.

Force filed

We specify as the force exerted by particle j on particle i. It can be concluded that the force expression acting on particle i is the sum of the forces of all other particles on it.

The interaction force F between particles can be obtained by deriving the distance between particles by potential function:

Force fields designed to describe biomolecules are parametrized to describe our PETaseSuper5-solution system and PETaseWT-solution system at high temperature. The evaluation results and the analysis of it can be seen as below.

Figure.6 RMSD of Super5 and WildTypePETase at 333K
Figure.7 RMSF of Super5 and WildTypePETase at 333K
Figure.8 SASAof Super5 and WildTypePETase at 333K

Many previous studies suggested that compared with the initial structure, proteins with lower RMSD values, lower RMSD values and lower SASA values during MD simulation tend to be more thermostable. The RMSD values of Super5 are lower than the wild-type, indicating that it’s more thermostable than wild-type PETase because its more nativelike (Fig.6). The RMSF values of Super5 are lower suggest that certain regions are more stable than wild-type PETase (Fig.7).The SASA values of Super5 are lower reveals that a tight core contributes the high thermostablility of Super5 (Fig.7).