FUSION PROTEIN MODELING
Motivation
We decided to create probiotics that synthesize naringenin - a compound that belongs to flavonoid class and shows antiprotozoal activity against Entamoeba histolytica. The synthesis of naringenin takes place in four reaction steps that are catalyzed by four enzymes: TAL (tyrosine ammonia-lyase from Herpetosiphon aurantiacus ), 4CL (4-coumarate-CoA ligase 2 from Glycine max ), CHS (chalcone synthase from Arabidopsis thaliana ), and CHI (chalcone isomerase from Medicago sativa ) [34]. Although, a high concentration of the compound is needed to have the desired effect on our target amoeba species. Therefore, we came up with an idea to increase the flavonoid synthesis by boosting the bottleneck reactions catalyzed by proteins 4CL and CHS by linking the two mentioned proteins together. Additionally, we were interested in checking whether the shorter distance between the active sites of the fused proteins provides higher efficiency of the reaction measured by the higher concentration of produced naringenin. For this purpose, we needed to determine what kind of structure this fusion protein complex would have. From this information we could calculate the distance between active sites of the enzymes.
Connection to synthetic biology
Synthetic biology allows scientists to reach beyond naturally evolved architectures by combining biological, chemical [44], and computer engineering. Since one of the goals of synthetic biology is to redesign systems that are found in nature to adjust them for desired purposes,this year our team decided to boost the production of naringenin in our system. To achieve this, we combined biological and chemical engineering attributes. We chose to include in silico modeling techniques to support one of the main concepts of synthetic biology - connecting different scientific fields - and broaden our understanding of the designed system. Specialists from structural bioinformatics and the de novo protein design field contributed to this part of our project by providing valuable expertise to improve our fusion protein modeling and make this element more substantiated.
Modeling approach
Once we came up with an idea to fuse proteins 4CL and CHS, we looked for existing modeling approaches applied to the cases like ours. The initial approach was to analyze the structure of the fusion protein complex based on the structural analysis of a candidate against malaria [1]. S. Shamriz and H. Ofoghi presented the basal approach that consists of primary structural analysis, secondary structure prediction, three-dimensional model prediction, structure validation with molecular dynamics, and the analysis of the simulations.
Choice of the fusion components
4CL (N')
We took 4CL from Glycine max. This protein does not have an experimentally determined structure. Hence, the template (PDB: 5BSR) (homology model) for the initial homology-based modeling was taken from Nicotiana tabacum, which is not an organism of our choice (more information about 4CL modeling can be found in 'Our fusion proteins' section). Sequence similarity between Glycine max and Nicotiana tabacum 4CL homologs is 67.66%, which is considered high enough to perform homology-based modeling for this protein.
CHS (C')
We took CHS protein from Arabidopsis thaliana. This protein has got its structure determined using the X-ray diffraction method, and it is deposited in Protein Data Bank with accession number 6DXB.
Linkers
Based on the literature analysis [4, 38, 39], we found out that beneficial interaction between two enzymes takes place after incorporating linkers that are composed of 5 to 15 amino acid residues. We included GSG linker because 4CL from Arabidopsis thaliana - a homolog to the protein of our choice - has already been linked with stilbene synthase (STS). There were in vitro kinetic assays performed for the 4CL:GSG:STS complex and its crystallographic structure (PDB: 3TSY) was deposited to Protein Data Bank. In summary, we chose to connect our proteins using linkers that are flexible: GSG, GGGGS, (GGGGS)2, (GGGGS)3; and rigid: EAAAK, (EAAAK)2, (EAAAK)3. They were analysed in terms of their hydrophobicity [31], sensitivity to proteases [32], and secondary structure [33].
Linker | Length | Flexibility | Hydrophobicity | Sensitivity to proteases | Secondary structure |
---|---|---|---|---|---|
GSG | 3 | Flexible | Neutral | - | Flexible |
GGGGS | 5 | Flexible | Neutral | - | Flexible |
(GGGGS)2 | 10 | Flexible | Neutral | - | Flexible |
(GGGGS)3 | 15 | Flexible | Neutral | - | Flexible |
EAAAK | 5 | Rigid | 60% hydrophobic | Glutamyl endopeptidase | α-helical |
20% acidic | LysC | ||||
20% basic | LysN | ||||
Proteinase K | |||||
Staphylococcal peptidase I | |||||
Thermolysin | |||||
Trypsin | |||||
(EAAAK)2 | 10 | Rigid | 60% hydrophobic | Asp-N endopeptidase + N-terminal Glu | α-helical |
20% acidic | Glutamyl endopeptidase | ||||
20% basic | LysC | ||||
LysN | |||||
Proteinase K | |||||
Staphylococcal peptidase I | |||||
Thermolysin | |||||
Trypsin | |||||
(EAAAK)3 | 15 | Rigid | 60% hydrophobic | Asp-N endopeptidase + N-terminal Glu | α-helical |
20% acidic | Glutamyl endopeptidase | ||||
20% basic | LysC | ||||
LysN | |||||
Proteinase K | |||||
Staphylococcal peptidase I | |||||
Thermolysin | |||||
Trypsin |
Primary structural analysis
There were proteins found that were formerly linked: 4CL and STS (stilbene synthase from Vitis vinifera ) with GSG linker. This structure was determined experimentally using the X-ray diffraction method and deposited into the Protein Data Bank database with PDB: 3TSY. The experimentally determined structure of linked proteins 4CL:GSG:STS does not have a part (residues 483-588) of the fusion protein sequence, which takes up the C terminal part of 4CL amino acid chain and GSG linker. Since GSG is flexible, it is not surprising that the structure of the GSG region is disordered. However, it is interesting that the whole region of the C terminal in 4CL protein becomes disordered. Considering the rigid linker case, there are examples of experimentally determined structures deposited in Protein Data Bank [28, 29], that prove that the structure of fusion proteins linked via rigid linkers can be experimentally determined without having a partial loss of the modeled sequence.
Ab initio protein modeling
Our motivation to choose ab initio protein modeling instead of homology-based is that homology-based programs might not cover the whole fusion protein sequence that we intend to model. This assumption was confirmed when we provided the trivial case of linking - 4CL linked with CHS via GSG. SWISS-MODEL [3] program provided one template that covered the whole target sequence - the analyzed 3TSY structure. The modeled system included a massive disordered region on the C terminal part of the 4CL sequence. The rest of the 48 templates covered roughly half of the sequence - on the N terminal or the C terminal. It was not surprising that there were no templates that would have a higher than 70 percent identity on the N terminal side since 4CL from G. max does not have an experimentally determined structure. Meanwhile, as expected, the SWISS-MODEL program did find a 100 percent match on the C terminal for CHS from A. thaliana. Since SWISS-MODEL builds each model based on one template, this approach was not suitable for our case.
Model | Template | Sequence identity | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|---|---|
model_1.pdb | 3TSY | 66.42% | -2.19 | 0.71 | 0.504224 |
model_2.pdb | 3TSY | 69.36% | -2.41 | 0.71 | 0.496802 |
Therefore, to model the complete structure of our fusion protein sequence, we tried using ab initio modeling methods (I-TASSER [7], RoseTTAFold [2], and trRosetta [8]) instead of homology-based modeling approaches. Eventually, the initial modeling was performed only using trRosetta and RoseTTAFold because I-TASSER did not complete the queries we uploaded to their server in May and June. Additionally, the standalone version could not be accessed at the time.
Lastly, we attempted to use both versions of trRosetta program: server and standalone. We found 3DFI [9] Perl scripts created to run the program and generate structure files in PDB format when the input sequence in FASTA is provided. We tested the program and managed to run it for a test target - 4CL protein. However, the standalone version of trRosetta did not finish the modeling of our fusion protein. The reason for this might be that the complex is simply too big, and the computer that we got access to run the program on had got little RAM (15 GB in total with 7 GB swap) with only 8 CPUs. There has also been an issue closed on Github [47] about the similar problem that trRosetta takes a long time on large MSA’s. It was explained that even running the program on a machine with GPU might throw an ‘out of memory’ error for MSA with ~600 residues. Meanwhile, our system is composed of roughly 1000 residues, and the computer did not contain any GPU; therefore, it is clear why we did not manage to run this program in our case.
RoseTTAFold modeling approach can be accessed via the Robetta server when the RoseTTAFold option is chosen. The first release of RoseTTAFold became open on Github as a standalone version of the program on 4th July; however, it requires over 400 GB of disc space. Since we did not have such resources, we could not use the standalone version of the program.
Before getting access to AlphaFold2, we initially attempted to model structures using RoseTTAFold without providing manually created multiple sequence alignment (MSA) files. In our project development, we had consultations with researchers that specialize in protein structural bioinformatics, and we were suggested to create our own MSA files. The mentioned approach should represent our fusion protein system not partially but as a whole more accurately (a more detailed description of this modeling approach can be found in the further sections). Since RoseTTAFold does not provide the MSA files that it generates by default, only an assumption can be made that these files might not accurately represent the region of the sequence that is located near the linker. Therefore, we wrote scripts to generate MSA files with template sequences for our fusion proteins connected with our chosen linker.
Scoring modeled complexes
The model of the complex had to be chosen according to its evaluation scores. We had a few consultations about the protein complex modeling and each of the consulting specialists of this field emphasized that we should apply at least several scoring systems, hence the best quality model would be chosen.
VoroMQA
VoroMQA [10] - the method is based on partitioning space into Voronoi cells. Two adjacent Voronoi cells share a set of points that form a surface called a Voronoi face, which can be viewed as a geometric representation of contact between two atoms. VoroMQA determines the quality of protein structural models using inter-atomic and solvent contact areas, including the knowledge-based statistical potential.
ProSA
ProSA [11] is a program that uses knowledge-based potentials of mean force to evaluate the accuracy of the input model. The structure’s energy is evaluated using a distance-based pair potential and a potential that captures the solvent exposure of protein residues. From these energies, two characteristics are derived: a z-score and a plot of its residue energies.
The z-score indicates overall model quality and measures the structure’s total energy deviation concerning an energy distribution derived from random conformations. Z-scores outside a range characteristic determined by structures of native proteins indicate errors and low quality of the model that is being evaluated.
TM-Score
TM-Score [12] is a template modeling score is a variation of the Levitt-Gerstein (LG) score, which weights shorter distances between corresponding residues more strongly than longer distances.
TM-align
TM-align [46] is an algorithm that employs TM-Score and dynamic programming to generate the best structural alignment between two protein structures.
CAD-Score
The definition of CAD-score [13] is based on the three considerations:
- contacts in the model should be evaluated according to the contacts in the reference structure (target);
- any missing residues in the model should be treated in the same way as if none of their contacts were correctly predicted;
- strong over-prediction (nonphysical overlap) of a particular contact should be equivalent to missing that contact entirely (in other words, there should be no false-positive predictions in terms of contact).
QMEAN
QMEAN [48] is an acronym for Qualitative Model Energy ANalysis. There are five different descriptors used for this evaluation approach, which are:
- torsion angle potential;
- distance-dependent pairwise residue potential;
- solvation potential;
- the term describing agreement of predicted and calculated secondary structure;
- the term describing solvent accessibility.
QMEANDisCo
In addition to QMEAN composite scoring, QMEANDisCo [14] introduces a new distance constraint (DisCo) that evaluates the agreement of pairwise distances and constraints taken from homologous experimentally determined protein structures.
ProQ3D
ProQ3D [15] is a single-model quality estimation program. ProQ estimated the quality of the whole model using a machine learning approach - support-vector machine - by summing up the predicted qualities for each residue. ProQ2 included profile weights to improve the predictions. ProQ3 included energy terms calculated from Rosetta.
Overall, ProQ3D uses the same inputs as in ProQ3. Although, a neural network is used to assess the quality of the protein structure instead of a support vector machine.
Ramachandran plot analysis
Protein structural scientists use Ramachandran plots to gain a better insight into the secondary structure of peptides. It represents the combinations of φ and ᴪ dihedral angles. Most pairs of φ and ᴪ are sterically forbidden since they introduce interatomic clashes. Outlines mark allowed regions in the graph. The plots were generated using RamachanDraw library. We thought it would be helpful to add the customization functionality; therefore, we wrote a small software tool RamachanDrawX that allows us to plot Ramachandran plots more efficiently and customize them for the desired design.
Structure refinement using molecular dynamics
After modeling and evaluating structures with several scoring functions, there is usually a need to refine structures after modeling in silico to avoid potential clashes between atoms and make the design more stable energetically. However, our protein complexes are rather big; they are composed of 962 - 974 residues ( 173 610 - 329 648 atoms without solvate particles). After consulting with a molecular dynamics specialist working with protein structures, we found out that at least 100 nanoseconds of simulations would be needed for our fusion protein system, which is a computationally costly procedure.
We chose GROMACS program to perform molecular dynamics simulations since it is freely available, versatile, and has an extensive documentation. Our initial choice of molecular dynamics protocol was based on a tutorial published in Bioinformatics Review [23]. As mentioned, we organized a consultation with a bioinformatics scientist who specializes in molecular dynamics. After the meeting, we had all the answers to the questions raised related to this part of modeling and received many suggestions on how to get the most out of molecular dynamics simulation. Generally, after the consultation, we adjusted our molecular dynamics protocol to be the one described in the table below. Explanations for each of the used commands can be found in the Perl script written for this molecular dynamics flow.
Before consultation | After consultation | Command |
---|---|---|
Preparing protein file by removing other chains and heterogenous atoms | `grep` | |
Converting protein's PDB file to Gromacs format GRO | `pdb2gmx` | |
Choosing force-field CHARMM27 | Choosing force-field AMBER94 | |
Choosing water model SPC | Choosing water model TIP3 | |
Defining a box with 2 Å distance between protein each wall of the dodecahedron type box | Defining a box with 1.2 Å distance between protein and each wall of the dodecahedron type box | `editconf` |
Solvating protein | `solvate` | |
Adding ions to neutralize the system, selecting group SOL | Adding ions to neutralize the system | `grompp`, `genion` |
Adding ions to simulate water with dissolved salt (concentration 150 mM) | ||
Energy minimization step to reach negative potential energy and maximum force to be less than 1000 kJ/mol/nm | `grompp`, `mdrun`, `energy` | |
Temperature equilibration for 100 ps, 300 K | `grompp`,`mdrun` `energy` | |
Pressure equilibration for 100 ps, 300 K | `grompp`,`mdrun` `energy` | |
Production run for 1 ns | Production run for 100 ns | grompp`, `mdrun` |
MARTINI coarse-grained simulations
Since we want to calculate the distance between active sites of our linked proteins, the cases with rigid linkers are relatively trivial. Although, we had to dig deeper about the approach to calculate distance between active sites in a flexible linker case. The initial idea was to find the maximum distance between active centers in the flexible linker case. This approach requires all possible conformations to find the one that provides the maximum distance between proteins. On the other hand, there might be particular conformations that are more likely than others. In any case, molecular dynamics simulations were needed to perform to catch the shifts in protein conformations.
However, since molecular dynamics simulations need to be performed on the large protein system consisting of roughly 1000 residues, the atomic molecular dynamics might last around a week on an 80 core system. Therefore bioinformatician V. Kairys suggested using MARTINI scripts and performing coarse-grained simulations. This kind of approach is less computationally expensive. The third version of MARTINI scripts is released, although we opted for the second version since it has more tutorials. Additionally, since Python3 is more commonly used nowadays, we chose martinize.py script adjusted for this Python version.
Modeling fusion system components
4CL
Since we take 4CL protein from Glycine max organism, we had to model the structure for our protein. Initially, we attempted to do this by using two programs: SWISS-MODEL [3] and RoseTTAFold. When AlphaFold2 [35] became freely available in July, we decided to model our chosen 4CL protein using this prominent tool. For this purpose, we used Google Colab notebook [36], in which AlphaFold2 was implemented, and we received three models in unrelaxed and relaxed states. According to the documentation found in DeepMind’s AlphaFold2 repository, the relaxed state contains a structure made after Amber relaxation procedure made on an unrelaxed structure - model provided right after modeling with AlphaFold2. It was not stated whether the average plDDT score provided as output is calculated for the unrelaxed or relaxed state. However, the models representing relaxed states were generally evaluated better.
Model trRosetta | QMEAN4 | QMEANDisCo | VoroMQA score |
---|---|---|---|
4CL.sqit.pdb | -0.94 | 0.67 | 0.478739 |
Model SWISS-MODEL | QMEAN4 | QMEANDisCo | VoroMQA score |
---|---|---|---|
4CL_1.pdb | -0.73 | 0.84 | 0.541558 |
4CL_2.pdb | -0.61 | 0.82 | 0.519713 |
4CL_3.pdb | -0.34 | 0.85 | 0.549002 |
Model RoseTTAFold | QMEAN4 | QMEANDisCo | VoroMQA score |
---|---|---|---|
MODEL_1_4CL2_SOYBN.pdb | 1.71 | 0.77 | 0.540512 |
MODEL_5_4CL2_SOYBN.pdb | 1.44 | 0.77 | 0.540305 |
MODEL_2_4CL2_SOYBN.pdb | 1.6 | 0.76 | 0.539742 |
MODEL_3_4CL2_SOYBN.pdb | 1.61 | 0.77 | 0.533458 |
MODEL_4_4CL2_SOYBN.pdb | 1.79 | 0.76 | 0.530254 |
Model | QMEAN4 | QMEANDisCo | VoroMQA score |
---|---|---|---|
MODEL_1_4CL_MSA.pdb | 1.88 | 0.77 | 0.541581 |
MODEL_2_4CL_MSA.pdb | 1.12 | 0.75 | 0.526281 |
MODEL_3_4CL_MSA.pdb | 1.09 | 0.76 | 0.529501 |
MODEL_4_4CL_MSA.pdb | 0.95 | 0.76 | 0.534835 |
MODEL_5_4CL_MSA.pdb | 1.41 | 0.76 | 0.529382 |
Model | QMEAN4 | QMEANDisCo | VoroMQA score | Average plDDT |
---|---|---|---|---|
4CL_unrelaxed_model_1.pdb | -0.62 | 0.79 | 0.546974 | 89.651979 |
4CL_unrelaxed_model_2.pdb | -1.01 | 0.79 | 0.549646 | 89.276523 |
4CL_unrelaxed_model_3.pdb | -1.02 | 0.79 | 0.546767 | 88.780811 |
4CL_relaxed_model_1.pdb | 0.13 | 0.8 | 0.551421 | - |
4CL_relaxed_model_2.pdb | 0.3 | 0.8 | 0.552571 | - |
4CL_relaxed_model_3.pdb | -0.86 | 0.8 | 0.549808 | - |
The model generated with AlphaFold2, when overlaid with homology-based modeling created structure, showed a structural mismatch at the C terminal from 453rd residue until the last one (562nd) in AlphaFold2 model; meanwhile, the design of SWISS-MODEL did not have the complete sequence modeled.
CHS
As mentioned in the description of the initial modeling approach, chalcone synthase (CHS) from A. thaliana has got its structure determined experimentally. However, after reading the CASP14 analysis about refinement category [25, 26], which analyzed the accuracy of AlphaFold2 models. The study pointed out that AlphaFold2 results might indicate crystal lattice contacts - the locations in the experimentally determined structure that are not as accurate as the rest of the protein. Therefore, it was intriguing to compare CHS’ experimental structure with its AlphaFold2 model and check whether the crystal lattice contacts of the experimental structure can be observed in our case.
Model | QMEAN4 | QMEANDisCo | VoroMQA score | Average plDDT |
---|---|---|---|---|
CHS_unrelaxed_model_1.pdb | 0.08 | 0.85 | 0.535824 | 96.86796 |
CHS_unrelaxed_model_2.pdb | -0.13 | 0.85 | 0.542367 | 96.62286 |
CHS_unrelaxed_model_3.pdb | -0.45 | 0.84 | 0.536282 | 96.34417 |
CHS_relaxed_model_1.pdb | 0.2 | 0.86 | 0.542643 | - |
CHS_relaxed_model_2.pdb | 0.21 | 0.86 | 0.539895 | - |
CHS_relaxed_model_3.pdb | 0.2 | 0.86 | 0.540226 | - |
For modeling to be optimal, AlphaFold2 should have at least 30 sequences coverage at each position. Meanwhile, plDDT measures how well the environment in the reference structure is represented in the modeled protein structure. Apart from the first positions in the N terminal, predicted lDDT was highly evaluated in each position of CHS protein.
The experimental structure did not have the design of the first six residues. This might be because the N terminal part of the sequence was considered to be a disordered region, just as it was with the C terminal of 4CL protein in the case of PDB: 3TSY. According to the PyMOL [50] B-factor [41] coloring of PDB: 6DXB structure, the less accurate residues were found in positions: 7-8, 67, 83-84, 208-209, 271-276, 295-299, 327-328 (Table 8). Meanwhile, the B-factor coloring showed that only the first eight in the N terminal and residues in positions 210-212 were inaccurate. Since the crystal packing can distort local protein structure, it can be assumed that there were crystal lattice contacts in the places of inaccurately represented residues. These residues were observed in the experimental structure and not in the model generated with AlphaFold2.
Experimental structure | Modeled structure |
---|---|
7-8 | 1-8 |
67 | - |
83-84 | - |
208-209 | 210-212 |
271-276 | - |
295-299 | - |
327-328 | - |
Visually observed differences between experimental structure and the best model from AlphaFold2 were seen in residues: 235-237 (partial alpha helix was not obtained in the model), 261-263, 269-273 (beta-sheets were introduced in the model), 383-385 (the beta-sheet was slightly bent in the model).
Initial fusion system modeling
We attempted to model fusion protein systems with seven different linkers: GSG, EAAAK, (EAAAK)2, (EAAAK)3, GGGGS, (GGGGS)2, and (GGGGS)3. Each structure was modeled using two ab initio protein modeling tools: trRosetta and RoseTTAFold. The structures were evaluated using VoroMQA scoring function. ProSA z-scores are not provided for all structures since the web-server was temporarily unavailable when we employed this modeling approach. The results are represented in the tables below. All structures were analyzed visually with PyMOL visualization program. The modeling results were not satisfactory due to the large portion of 4CL being uncoiled in the fusion protein model. The model for each linker case was chosen according to VoroMQA score.
We attempted to use Yasara [45] program to perform energy minimization. The output of Yasara did introduce a slight structural shift, which was observed with PyMOL command `cealign` when the minimized model was compared with the original 4CL:GSG:CHS model. However, the VoroMQA score of the original complex was higher (0.516429) than of the one that was generated by Yasara program (0.512734). Since the structural change was not significant, we did not apply this program to other structures.
Besides Yasara, we also attempted to do structural refinement using molecular dynamics simulations. Since molecular dynamics is a computationally expensive procedure, we tried performing a trivial version of simulations on our target system. In order to test this technique, we used steps described in the section about the molecular dynamics approach before consulting the specialist of the field.
The results of the modeling and refinement flow for 4CL:GSG:CHS case are presented in the supplementary data file.
Modeling strategy with MSA
After consulting specialists in the protein modeling field and receiving fresh ideas about improving our modeling workflow, we were suggested including multiple sequence alignment (MSA) as an additional input into our chosen modeling programs. Nonetheless, both trRosetta and RoseTTAFold generate MSA internally, yet these MSA files do not represent the linked protein case accurately enough because the output models still had a large portion of the disordered region between the linked proteins. Therefore, we were encouraged to use sequence-search methods to create MSA files for our modeling input ourselves.
Bioinformaticians from Stockholm University presented PconsDock approach [6], showing that the fold-and-dock method can model a pair of proteins linked via poly-G between two chains. This technique resembled our problem, thus we attempted to include into the constructed MSA files not only the exact linker sequence yet also the prolonged with glycines on both sides version of it. We wrote scripts to generate multiple sequence alignment files adjusted to our system of fused proteins. The script that produces MSA files was written in Perl. This script takes as input two a3m files, a linker sequence, and a boolean option whether to extend the linker sequence or not. The variation to extend linker sequence with poly-G was included in the script’s functionality, although this modeling method could not be applied practically; the extended linker system was not intended to be checked in the wet-lab experimentally. Therefore we continued this modeling flow only with the exact linker sequence included in the MSA. Full query-template a3m MSA files were obtained from HHblits [40] server and taken as an input to our script. The script reads the MSA file of the first protein and saves taxonomic IDs of the species to which the sequences belong. Afterward, the script reads the MSA file of the second protein, checks whether the taxonomic ID of the sequence has a match in the first MSA file, and if the match is found, the sequences are merged via the linker sequence. The final multiple sequence alignment file is printed on the screen of the terminal.
Results of modeling approach with MSA
The results of the modeling with MSA files included can be found in supplementary data file.
Modeling with AlphaFold2
The highly accurate protein structure prediction tool AlphaFold2 evaluated in CASP14 was published in Nature to be freely open to the public on 15th July 2021. Since the program requires extensive computational resources, we could not install it and run it locally. However, Kliment Olechnovič, the bioinformatician from Vilnius University, suggested that we model our protein complexes using a Google Colab notebook which has AlphaFold2 adjusted to protein complexes. According to the CASP14 evaluation of model refinement category [25, 26], AlphaFold2 generated structures require no structure refinement in most cases. During the modeling software assessment, the errors observed in the modeled structures were mainly found at crystal lattice contacts, where the experimentally determined structure is likely to be unrepresentative of biological conformations. Therefore molecular dynamics simulations for AlphaFold2 modeled structure refinement are unnecessary.
The modeling with AlphaFold2 was performed using its generated MSA files with MMseq2. The composed MSA files covered each part of the fusion protein system by at least 500 sequences, which is a good result because the best modeling results are achieved when there is coverage of 30 (ideally 100) sequences at each position (based on the notes in Google Colab notebook ).
AlphaFold2 has five parameter sets for modeling, thus in most cases, it can produce five distinct modeling variants. Although our fusion protein system is composed of around 1000 residues that make up a rather large complex, the Colab version of AlphaFold2 gave us two unrelaxed models as an output for each linker case. In general, the quality of the models was impressive, according to plDDT scoring; the score was mostly high except for the linker region.
Predicted aligned error (PAE) gives information about the relative orientations of different parts of the modeled structure [49]. The PAE plot shows that each modeled part of the fusion protein system - 4CL, the linker, and CHS - are modeled reliably within each portion, yet their positions relative to each other are not evaluated confidently. Therefore, only an assumption can be made that rigid linkers take up the modeled orientation. However, rigid linker domains and terminals of the linked proteins might have disordered parts that significantly change the conformation. Although, this assumption had to be made because we had limited access to computational resources for molecular dynamics simulations to receive a better view of the system. The main purpose of this modeling workflow is to find distances between active sites of the fused proteins. Therefore we required molecular dynamics simulations to determine the most likely conformations of protein systems, and due to computational limitations, we restricted this idea to the systems with flexible linkers.
Ramachandran plots show whether there are no steric clashes in the structures. Apparently, the given plots show that the clashes are present in the modeled structures. The reason for this is that the models are not relaxed (Amber force-field was not applied).
Linker GSG
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_GSG_CHS_unrelaxed_model_1.pdb | -1.01 | 0.78 | 0.539915 |
4CL_GSG_CHS_unrelaxed_model_2.pdb | -1.18 | 0.77 | 0.537669 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_GSG_CHS_unrelaxed_model_1.pdb | 0.709 | 0.625 | 0.705 | |
4CL_GSG_CHS_unrelaxed_model_2.pdb | 0.701 | 0.633 | 0.801 | 0.721 |
Linker EAAAK
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_EAAAK_CHS_unrelaxed_model_1.pdb | -0.69 | 0.79 | 0.542077 |
4CL_EAAAK_CHS_unrelaxed_model_2.pdb | -1.03 | 0.79 | 0.536476 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_EAAAK_CHS_unrelaxed_model_1.pdb | 0.708 | 0.624 | 0.794 | 0.710 |
4CL_EAAAK_CHS_unrelaxed_model_2.pdb | 0.7 | 0.63 | 0.798 | 0.717 |
Linker (EAAAK)2
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_EAAAK_2_CHS_unrelaxed_model_1.pdb | -0.78 | 0.79 | 0.541834 |
4CL_EAAAK_2_CHS_unrelaxed_model_2.pdb | -1.16 | 0.78 | 0.536158 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_EAAAK_2_CHS_unrelaxed_model_1.pdb | 0.69 | 0.638 | 0.81 | 0.718 |
4CL_EAAAK_2_CHS_unrelaxed_model_2.pdb | 0.693 | 0.616 | 0.789 | 0.691 |
Linker (EAAAK)3
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_EAAAK_3_CHS_unrelaxed_model_1.pdb | -1.36 | 0.78 | 0.537969 |
4CL_EAAAK_3_CHS_unrelaxed_model_2.pdb | -1.38 | 0.77 | 0.533997 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_EAAAK_3_CHS_unrelaxed_model_1.pdb | 0.693 | 0.617 | 0.795 | 0.697 |
4CL_EAAAK_3_CHS_unrelaxed_model_2.pdb | 0.693 | 0.634 | 0.803 | 0.713 |
Linker GGGGS
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_GGGGS_CHS_unrelaxed_model_1.pdb | -0.75 | 0.78 | 0.542 |
4CL_GGGGS_CHS_unrelaxed_model_2.pdb | -1.16 | 0.77 | 0.536974 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_GGGGS_CHS_unrelaxed_model_1.pdb | 0.691 | 0.617 | 0.779 | 0.698 |
4CL_GGGGS_CHS_unrelaxed_model_2.pdb | 0.691 | 0.640 | 0.809 | 0.725 |
Linker (GGGGS)2
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_GGGGS_2_CHS_unrelaxed_model_1.pdb | -1.02 | 0.78 | 0.540667 |
4CL_GGGGS_2_CHS_unrelaxed_model_2.pdb | -0.9 | 0.78 | 0.53278 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_GGGGS_2_CHS_unrelaxed_model_1.pdb | 0.697 | 0.635 | 0.807 | 0.716 |
4CL_GGGGS_2_CHS_unrelaxed_model_2.pdb | 0.697 | 0.629 | 0.802 | 0.725 |
Linker (GGGGS)3
Model | QMEAN | QMEANDisCo | VoroMQA |
---|---|---|---|
4CL_GGGGS_3_CHS_unrelaxed_model_1.pdb | -1.25 | 0.78 | 0.540568 |
4CL_GGGGS_3_CHS_unrelaxed_model_2.pdb | -1.36 | 0.77 | 0.531981 |
Model | ProQ2D | ProQRosCenD | ProQRosFAD | ProQ3D |
---|---|---|---|---|
4CL_GGGGS_3_CHS_unrelaxed_model_1.pdb | 0.696 | 0.627 | 0.801 | 0.707 |
4CL_GGGGS_3_CHS_unrelaxed_model_2.pdb | 0.687 | 0.626 | 0.804 | 0.719 |
Conclusion
It is possible that molecular dynamics simulations might bring a more accurate view about distances between protein active sites, since MD simulations take into account movements of flexible parts in the fusion complex. However, the complexes are large structures, thus they would require powerful computational resources.
We compared the distances that we captured in our modeling flow and Kcat/Km values of 4CL and STS complex analyzing the respective flexible linker cases from literature [39]. In the case of 4CL linked with STS, Kcat/Km values decrease for the linkers ordered as GSG, GGGGS, (GGGGS)2, (GGGGS)3. If we continued with the hypothesis that was set in the beginning, the tendency of increasing distances for linkers ordered in the mentioned way cannot be observed in Table 16. This comparison does not bring a reliable insight, since the modeled complex does not match the one in the literature. What is more, the rigid linker cases are not included.
Unfortunately, naringenin was not produced in our wet lab, therefore we decided not to run molecular dynamics simulations. Additionally, the initially planned correlation analysis of distances between protein active sites and the production of naringenin measure could not be made.
Linker | Distance between active sites (Å) |
---|---|
GSG | 40.1 |
GGGGS | 38.8 |
(GGGGS)2 | 39.4 |
(GGGGS)3 | 39.2 |
EAAAK | 42.7 |
(EAAAK)2 | 42.7 |
(EAAAK)3 | 47.6 |
Alternative modeling approaches
Reverse linking GSG
After doing a visual analysis of the initially modeled complexes, we saw that the C terminal of 4CL structure in a fusion protein is unfurled in all cases of linkers; therefore, we started to look for ways to improve our modeling approach. One of the ideas was to perform reverse linking and check how the target domain behaves when the proteins are linked in the reverse order. Although the results showed an even more disrupted structure of linked proteins, therefore we rejected this approach. It can be seen that the linker is found in the modeled domain of CHS structure. Additionally, both structures CHS and 4CL are not structurally aligned to their native structures.
The results of the reverse linking for 4CL:GSG:CHS case are presented in the supplementary data file.
Separate modeling
The idea of the separate structural modeling was to model fusion proteins with a chosen rigid linker separately and align the distinct structures afterward. The naive expectation was to get modeled sequences aligned via a portion of the linker sequence. We decided to test this technique with the trivial case of rigid linker - EAAAK. However, due to the short length of sequence that was intended to be aligned, PyMOL did not perform `cealign` command successfully.
The results of the separate modeling for 4CL:GSG:CHS case are presented in the supplementary data file.
Modeling linkers as unknown amino acids
28th July S. Ovchinnikov shared that AlphaFold2 is capable of modeling structures with unknown amino acids (labeled as ‘U’). Therefore we decided to give this feature a try and model structures with poly-U chains as linkers with N equal to 3, 5, 10, and 15. Although it was noted that the relaxation procedure is unavailable when unknown amino acids are in the structure.
The results of modeling with unknown amino acids in the place of linkers show that linkers of 3 and 5 amino acids provide protein orientation that is similar to the one modeled with known amino acids. Meanwhile, linkers of 10 and 15 unknown amino acids doom the higher degree of freedom to the orientation of the complex.
Scores of the models with unknown aminoacids can be found in supplementary data
Modeling strategy with docking
Another approach that was suggested is to try applying protein docking. For this approach, steps would be to use protein-protein docking software, generate a significant number of conformations, filter conformations based on the distance between linked ends according to the length of the linker that the structure is modeled for, pick the filtered conformations, evaluate quality with previously used scoring functions. This approach was not applied due to the shortage of time.