Team:SYSU-Software/Model

<!DOCTYPE html> Team:SYSU-Software/Model - 2021.igem.org

Loading menubar.....

Team:SYSU-Software/Model

<!DOCTYPE html>

Structure Prediction

trRosetta is an improved structure prediction program based on AlphaFold. Compared with AlphaFold, which uses only the distance between residues for prediction[2], trRosetta uses both the distance and direction between residues[3], exploiting richer structural data, and thus is able to achieve more accurate prediction results. Theoretically trRosetta can perform the full set of calculations on a desktop computer, but we finally chose to rent a server to run the structure prediction module because of time cost and other considerations.

trRosetta predicts the distance and orientation relationships between amino acid residues from sequence alignment files, and translates the above relationships into smooth Rosetta restriction parameters that will be used in subsequent energy minimization modeling to minimize the energy of the folded structure by gradient descent to obtain the predicted protein structure[3].

- reliability assessment of structural prediction

We used the TM score (Template modeling score) for the reliability assessment of the prediction. tm-score is a metric used to detect the structural similarity of two proteins. TM-score between two protein structures is defined by:

where is the length of the amino acid sequence of the target protein and is the number of residues that commonly appear on the template and target structures. is the distance between the ith pair of residues between the template and the target structures, and is a distance scale that normalizes distances. When compare two protein structures that have the same residue order, reads from C-alpha order number of the structure files.

The TM-Score value is between 0 and 1, with 1 meaning that the two are in complete agreement. When the score is below 0.17, it means that they are completely different structures chosen at random; when the score is greater than 0.5, it means that the two fold similarly.

We collected nine proteins in the PDB database published after May 1, 2018, and scored them for structure prediction and accuracy. The TM scores of the selected nine proteins were obtained, and the raw data are shown in the table, and the scatter plots of TM scores versus protein amino acid sequence lengths were plotted.

TM-score of the predicted protein

Result of the TM-score of the predicted protein

The Pearson’s product-moment correlation of the TM-score result

Based on the data, we know that the TM scores of all the predicted results are >0.5, which proves that the predicted structures are reliable. Meanwhile, in the scatter plot we found that the TM score gradually decreased as the length of protein sequences increased, showing a negative correlation between length and prediction accuracy. After further dividing the protein sequence length into three groups of 50-150, 150-250 and more than 250 residues, the TM score of each group was averaged and the histogram was drawn to show that this trend was more obvious. This indicates that trRosetta predicts protein structures with better accuracy. What's more, it shows that trRosetta achieves better prediction accuracy for shorter sequences and slightly lower prediction accuracy for longer sequences, but still reliable (TM score > 0.6). Also, such results are in agreement with our knowledge that the longer the amino acid sequence, the more exponentially the possible ways of peptide chain folding increase, and the more difficult it is to predict the interactions between residues.

Histogram of the TM-score of the predicted protein

CAD-score

We used the CAD-score[5] as an index to evaluate the structural similarity of the active site before and after protein fusion. The CAD-score assesses the difference in contact area between the central residue and the surrounding residues between the two structures.

The calculation is done as follows: let G denote the set of all residue pairs (i, j) with nonzero contact area T (i, j) in the target structure. Then, for each residue pair (i, j) ∈ G, we compute the contact area M (i, j) in the model. If no other residues are present in the target in the model, these residues are excluded from the calculation of the contact area. If some residues are present in the target, but the residue is missing in the model, all contact areas of that residue in the model are assigned to zero.

Then, for each residue pair (i, j) ∈ G, we can define the contact area difference as the absolute difference in the contact area between the target T and the residues i and j in the model M as:

In order to symmetrically treat the overprediction and underprediction of the contact area, we use the bounded CAD(i, j) defined as the following conditions instead of the original CAD(i, j) values:

The CAD score for the entire model is defined as:

From the above formula, we can see that the value of CAD-score is between 0 and 1, and the closer it is to 1, the more similar the two structures are, and if it is closer to 0, the two structures are completely different.

Three-step Catalyzed Reaction System

This is a model about three-step catalyzed reaction system with free enzymes and enzyme-enzyme complexes in it. Based on the work done by Poshyvailo et al[6], and Kuzmak et al[7], we established our model to analyze the system, and with our program, we can get the approximate function of total reaction velocity v about enzyme gathering proportion k with numerical parameter input.

parameter table

- hypothesis

(1) Three kinds of enzyme can cluster to form complex E12, E23, E13 (because of opto-switch protein oligomerize when exposed to light. And because dimerize are most common, we only consider these three complexes);

(2) Intermediates I1, I2 transfer from enzyme to enzyme by diffusion;

(3) We assume that any single complex occupies a computational box (complexes are evenly distributed in the whole system, and the distribution is sparse enough), and has a relatively independent environment;

(4) For simplicity, we model the enzymes as spherical particles;

(5) The concentration of substrate in the first step reaction remains unchanged, i.e., [S] is a constant;

(6) Because the reactions catalyzed by E1 and that by E3 are relatively independent, so we treat E13 as free enzyme E1 and E3;

(7) The total concentration of each enzyme in the system is the same.

- model

With hypothesis (1),we get all the reaction equations in the system as follows:

The first step

The second step

The third step

For the reason that the concentration of intermediate is relatively higher in the neighborhood of the enzyme upon which it is produced, it’s necessary for us to use a model about the distribution of intermediate to evaluate the acceleration brought by opto-switch protein oligomerization-induced proximity.

schematic diagram of intermediate reaction, orange color density reflects the concentration of intermediate

With assumptions (2)&(4), we use the model in Kuzmak et al[7] (Supplementary Information S4), and get the expression of the concentration of I1 at a point in the computational box about the distance between the point and the active site of E1 in the complex E12.

where ,

Similarly, we can get the expression of the concentration of I2 at a point in the computational box about the distance between the point and the active site of E2 in the complex E23 and the concentration of I1.

where ,

The following are the evolution equations of the concentration of all contents in the system:

equations about free enzyme

equations about complexes

Besides, according to the principle of mass conservation, the total amount of enzyme satisfies the following equations with the assumption (6)&(7) (k denotes enzyme gathering proportion):

Considering the steady-state situation (let the change rate of substance concentration on the left side of all the evolution equations be 0), we get a series of algebraic equations and can theoretically get an explicit solution. However, we failed to solve the system because of its complexity and had to simplify the model. Briefly, we separate the three-step reaction system into two parts, decomposing all the equations into two sections about I1 and I2 respectively. When dealing with each section, we handle free enzyme and enzyme in complex separately, and then combine the outcome according to their proportion in the system. By doing so, and using the expression of total reaction velocity

we can get the function of v about enzyme gathering proportion k with numerical parameter input with our program.

With the model of three-step catalyzed reaction system, we designed an opto-controllable elements system, which controls cascade reaction with light intensity. If target reaction is catalyzed with enzyme1, 2, 3 (call them as E1, E2, E3, and we know the kinetic of them in advance) and chooses CRY2 as opto-switch to create opto-controllable element for each one( called as ele1, ele2, ele3), we will get 6 kinds of dimers, but only ele1-ele2 dimer and ele2-ele3 dimer can accelerate the reaction. By deciding the active site of each enzyme, we can get the relation between light and reaction rate.

The relation between light intensity (L) and reaction rate (V) using the model of three-step catalyzed reaction system. It shows that The initial reaction rate increases sharply with slight changes in light intensity. When the light intensity reaches a certain level, the reaction rate will remain within a certain range. the function shows as:
V=7.36708272×10-20 L4-3.56160460×10-16 L3+3.58828244×1011 L2-1.08007577×106 L+1.31100798

SLinker: Linker Database

Linker is a sequence which links between two domains. Its length and the kinds of amino acid containing determines its function. Longer linkers are more hydrophilic and exposed in the aqueous solvent than short linkers[8]. According to protein secondary structure, linkers can be separated to helical linker and non-helical linker[8]. In an α-helix structure, Helix linkers are able to separate domains as rigid space, which decreasing inter-domain interference. Non-helical linkers are rich in Pro without fixed rigid structure.

- linker online database

Up to now, there are three public linker online databases: LINKER[9], IBIVU LinkerDataBase[8] and Synlinker[10]. However, in addition to IBIVU LinkerDataBase, other databases are inaccessible. The IBIVU LinkerDataBase was established in 2002, and the data is not updated in time. Its analysis is based on the protein structure at that time. Since then, CATH, SCOP and other protein structure analysis tools and three-dimensional protein prediction tools as Rosetta, AlphaFold have been developed to provide more accurate and reasonable analysis of protein secondary structure, super secondary structure, and tertiary structure. Therefore, there is currently no online natural linker database based on the latest protein structure and updated in real time.

- Slinker database

Based on the data from the National Center for Biotechnology Information (NCBI) and Conserved Domain Database (CDD), we obtained the linkers by subtracting the conserved domain sequence from the full-length sequence of a single protein, and sorted data was placed in the database SLinker, so that the user can select the desired linker according to requirements such as length range, solvent accessibility, terminal amino acid type etc.

We used the cd-search tool in CDD to obtain the conserved domains of the 4799 multi-domain protein sequences obtained in NCBI, and then preprocess the searched data to filter out unqualified conserved domains and search for maximum of non-overlapping conserved domains in a protein sequence. If the number of conserved domains is 1, the full-length sequence is still a conserved domain and there is no linker. If the number is greater than 1, these conserved domain sequences are subtracted from the full-length sequence, and the first and last sequences are removed, and the final result is the linker sequence.

We collected 7124 linkers, among which there are 3940 linkers with amino acid residue length from 26 to 100. Due to the limitation of computing power, we use trRosetta[3] to predict the structure of 500 linkers to obtain the three-dimensional structure, and use FreeSASA[11] to calculates the solvent accessibility of these linkers.

In order to make the SLinker database more complete, we have added empirical linkers which have been reported[12-17], and classified as rigid linker and flexible linker[18].

Finally, we get a database Slinker, each linker records primary sequence, sequence length, three-dimensional structure, solution accessibility, and related conservative domain information. Users can determine the length range, solvent accessibility, terminal amino acid type and other requirements to obtain the desired linker sequence and display it.

- data analysis of natural linker

We conducted an analysis of the 3940 linkers in 26-100 AA in the Slinker database, and the results showed that the length of this group of linkers was mainly concentrated between 40 and 60.

The distribution of linker amino acid types is consistent with the experimental results of Agros and George and Heringa. The inclusion of Pro residues may increase the rigidity and structural independence of the linker, so most multi-domain proteins contain interdomain linkers with pro-rich sequences. Small polar amino acids (such as Thr, Ser and Gly) may provide good flexibility due to their small size. Forming hydrogen bonds with water in an aqueous solution can help maintain the stability of the linker structure.

red bars represent amino acids that account for more than 5%, and blue bars represent amino acids that account for less than 5%

The linker amino acid distribution difference (diff) is obtained by the following formula:

where refers to the amino acid distribution of the linker, and refers to the protein distribution[19]. We found that the difference distribution of proline and tyrosine is particularly obvious, indicating that Pro may be more likely to appear in the linker, and Tyr tends to appear in the domain.

blue bars represent amino acids with ratios greater than 0, red bars represent amino acids with ratios less than 0

References

(1) Mao, W. Z., Ding, W. Z., Xing, Y. G. & Gong, H. P. AmoebaContact and GDFold as a pipeline for rapid de novo protein structure prediction. Nat Mach Intell 2, 25-33 (2020).

(2) Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706-710, doi:10.1038/s41586-019-1923-7 (2020).

(3) Yang, J. Y. et al. Improved protein structure prediction using predicted interresidue orientations. P Natl Acad Sci USA 117, 1496-1503, doi:10.1073/pnas.1914677117 (2020).

(4) Ribeiro, A. J. M. et al. Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res 46, D618-D623, doi:10.1093/nar/gkx1012 (2018).

(5) Olechnovic, K., Kulberkyte, E. & Venclovas, C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 81, 149-162, doi:10.1002/prot.24172 (2013).

(6) Poshyvailo, L., von Lieres, E. & Kondrat, S. Does metabolite channeling accelerate enzyme-catalyzed cascade reactions? Plos One 12, doi:ARTN e017267310.1371/journal.pone.0172673 (2017).

(7) Kuzmak, A., Carmali, S., von Lieres, E., Russell, A. J. & Kondrat, S. Can enzyme proximity accelerate cascade reactions? Sci Rep-Uk 9, doi:ARTN 45510.1038/s41598-018-37034-3 (2019).

(8) George, R. A. & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng 15, 871-879, doi:10.1093/protein/15.11.871 (2002).

(9) Crasto, C. J. & Feng, J. A. LINKER: a program to generate linker sequences for fusion proteins. Protein Eng 13, 309-312, doi:10.1093/protein/13.5.309 (2000).

(10) Liu, C., Chin, J. X. & Lee, D. Y. SynLinker: an integrated system for designing linkers and synthetic fusion proteins. Bioinformatics 31, 3700-3702, doi:10.1093/bioinformatics/btv447 (2015).

(11) Mitternacht, S. FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Res 5, 189, doi:10.12688/f1000research.7931.1 (2016).

(12) Huston, J. S. et al. Protein engineering of antibody binding sites: recovery of specific activity in an anti-digoxin single-chain Fv analogue produced in Escherichia coli. Proc Natl Acad Sci U S A 85, 5879-5883, doi:10.1073/pnas.85.16.5879 (1988).

(13) Bai, Y. & Shen, W. C. Improving the oral efficacy of recombinant granulocyte colony-stimulating factor and transferrin fusion protein by spacer optimization. Pharm Res 23, 2116-2121, doi:10.1007/s11095-006-9059-5 (2006).

(14) Sabourin, M., Tuzon, C. T., Fisher, T. S. & Zakian, V. A. A flexible protein linker improves the function of epitope-tagged proteins in Saccharomyces cerevisiae. Yeast 24, 39-45, doi:10.1002/yea.1431 (2007).

(15) Lu, P. & Feng, M. G. Bifunctional enhancement of a beta-glucanase-xylanase fusion enzyme by optimization of peptide linkers. Appl Microbiol Biotechnol 79, 579-587, doi:10.1007/s00253-008-1468-4 (2008).

(16) Bergeron, L. M., Gomez, L., Whitehead, T. A. & Clark, D. S. Self-Renaturing Enzymes: Design of an Enzyme-Chaperone Chimera as a New Approach to Enzyme Stabilization. Biotechnol Bioeng 102, 1316-1322, doi:10.1002/bit.22254 (2009).

(17) de Bold, M. K. et al. Characterization of a long-acting recombinant human serum albumin-atrial natriuretic factor (ANF) expressed in Pichia pastoris. Regul Pept 175, 7-10, doi:10.1016/j.regpep.2012.01.005 (2012).

(18) Chen, X., Zaro, J. L. & Shen, W. C. Fusion protein linkers: property, design and functionality. Adv Drug Deliv Rev 65, 1357-1369, doi:10.1016/j.addr.2012.09.039 (2013).

(19) R.F., D. in Prediction of Protein Structure and the Principles of Protein Conformation. (ed Fasman G.D. (eds)) (1989).