Team:Vilnius-Lithuania/Fusion Protein Modeling

FUSION PROTEIN MODELING

Header

Motivation

We decided to create probiotics that synthesize naringenin - a compound that belongs to flavonoid class and shows antiprotozoal activity against Entamoeba histolytica. The synthesis of naringenin takes place in four reaction steps that are catalyzed by four enzymes: TAL (tyrosine ammonia-lyase from Herpetosiphon aurantiacus ), 4CL (4-coumarate-CoA ligase 2 from Glycine max ), CHS (chalcone synthase from Arabidopsis thaliana ), and CHI (chalcone isomerase from Medicago sativa ) [34]. Although, a high concentration of the compound is needed to have the desired effect on our target amoeba species. Therefore, we came up with an idea to increase the flavonoid synthesis by boosting the bottleneck reactions catalyzed by proteins 4CL and CHS by linking the two mentioned proteins together. Additionally, we were interested in checking whether the shorter distance between the active sites of the fused proteins provides higher efficiency of the reaction measured by the higher concentration of produced naringenin. For this purpose, we needed to determine what kind of structure this fusion protein complex would have. From this information we could calculate the distance between active sites of the enzymes.

Connection to synthetic biology

Synthetic biology allows scientists to reach beyond naturally evolved architectures by combining biological, chemical [44], and computer engineering. Since one of the goals of synthetic biology is to redesign systems that are found in nature to adjust them for desired purposes,this year our team decided to boost the production of naringenin in our system. To achieve this, we combined biological and chemical engineering attributes. We chose to include in silico modeling techniques to support one of the main concepts of synthetic biology - connecting different scientific fields - and broaden our understanding of the designed system. Specialists from structural bioinformatics and the de novo protein design field contributed to this part of our project by providing valuable expertise to improve our fusion protein modeling and make this element more substantiated.

Modeling approach

Once we came up with an idea to fuse proteins 4CL and CHS, we looked for existing modeling approaches applied to the cases like ours. The initial approach was to analyze the structure of the fusion protein complex based on the structural analysis of a candidate against malaria [1]. S. Shamriz and H. Ofoghi presented the basal approach that consists of primary structural analysis, secondary structure prediction, three-dimensional model prediction, structure validation with molecular dynamics, and the analysis of the simulations.

Choice of the fusion components

4CL (N')

We took 4CL from Glycine max. This protein does not have an experimentally determined structure. Hence, the template (PDB: 5BSR) (homology model) for the initial homology-based modeling was taken from Nicotiana tabacum, which is not an organism of our choice (more information about 4CL modeling can be found in 'Our fusion proteins' section). Sequence similarity between Glycine max and Nicotiana tabacum 4CL homologs is 67.66%, which is considered high enough to perform homology-based modeling for this protein.

Fig. 1. 4CL2 from Glycine max (pink) with 4CL2 from Nicotiana tabacum (electric)

CHS (C')

We took CHS protein from Arabidopsis thaliana. This protein has got its structure determined using the X-ray diffraction method, and it is deposited in Protein Data Bank with accession number 6DXB.

Fig. 2. CHS catalyzed reaction
Fig. 3. Active site residues of CHS

Linkers

Based on the literature analysis [4, 38, 39], we found out that beneficial interaction between two enzymes takes place after incorporating linkers that are composed of 5 to 15 amino acid residues. We included GSG linker because 4CL from Arabidopsis thaliana - a homolog to the protein of our choice - has already been linked with stilbene synthase (STS). There were in vitro kinetic assays performed for the 4CL:GSG:STS complex and its crystallographic structure (PDB: 3TSY) was deposited to Protein Data Bank. In summary, we chose to connect our proteins using linkers that are flexible: GSG, GGGGS, (GGGGS)2, (GGGGS)3; and rigid: EAAAK, (EAAAK)2, (EAAAK)3. They were analysed in terms of their hydrophobicity [31], sensitivity to proteases [32], and secondary structure [33].

Table 1. Characterization of linkers
Linker Length Flexibility Hydrophobicity Sensitivity to proteases Secondary structure
GSG 3 Flexible Neutral - Flexible
GGGGS 5 Flexible Neutral - Flexible
(GGGGS)2 10 Flexible Neutral - Flexible
(GGGGS)3 15 Flexible Neutral - Flexible
EAAAK 5 Rigid 60% hydrophobic Glutamyl endopeptidase α-helical
20% acidic LysC
20% basic LysN
Proteinase K
Staphylococcal peptidase I
Thermolysin
Trypsin
(EAAAK)2 10 Rigid 60% hydrophobic Asp-N endopeptidase + N-terminal Glu α-helical
20% acidic Glutamyl endopeptidase
20% basic LysC
LysN
Proteinase K
Staphylococcal peptidase I
Thermolysin
Trypsin
(EAAAK)3 15 Rigid 60% hydrophobic Asp-N endopeptidase + N-terminal Glu α-helical
20% acidic Glutamyl endopeptidase
20% basic LysC
LysN
Proteinase K
Staphylococcal peptidase I
Thermolysin
Trypsin

Primary structural analysis

There were proteins found that were formerly linked: 4CL and STS (stilbene synthase from Vitis vinifera ) with GSG linker. This structure was determined experimentally using the X-ray diffraction method and deposited into the Protein Data Bank database with PDB: 3TSY. The experimentally determined structure of linked proteins 4CL:GSG:STS does not have a part (residues 483-588) of the fusion protein sequence, which takes up the C terminal part of 4CL amino acid chain and GSG linker. Since GSG is flexible, it is not surprising that the structure of the GSG region is disordered. However, it is interesting that the whole region of the C terminal in 4CL protein becomes disordered. Considering the rigid linker case, there are examples of experimentally determined structures deposited in Protein Data Bank [28, 29], that prove that the structure of fusion proteins linked via rigid linkers can be experimentally determined without having a partial loss of the modeled sequence.

Fig. 4. Modeled 4CL (pink) overlay with monomeric enzymes 4CL and STS (PDB 3TSY, electric), disordered domain (light pink)

Ab initio protein modeling

Our motivation to choose ab initio protein modeling instead of homology-based is that homology-based programs might not cover the whole fusion protein sequence that we intend to model. This assumption was confirmed when we provided the trivial case of linking - 4CL linked with CHS via GSG. SWISS-MODEL [3] program provided one template that covered the whole target sequence - the analyzed 3TSY structure. The modeled system included a massive disordered region on the C terminal part of the 4CL sequence. The rest of the 48 templates covered roughly half of the sequence - on the N terminal or the C terminal. It was not surprising that there were no templates that would have a higher than 70 percent identity on the N terminal side since 4CL from G. max does not have an experimentally determined structure. Meanwhile, as expected, the SWISS-MODEL program did find a 100 percent match on the C terminal for CHS from A. thaliana. Since SWISS-MODEL builds each model based on one template, this approach was not suitable for our case.

Fig. 5. Modeled 4CL:GSG:CHS (fusion - electric, linker - cyan) overlay with template structures of enzymes 4CL (pink) with disordered domain (light pink) and CHS (dark green)
Table 2. Evaluating 4CL:GSG:CHS structure that was modeled using homology-based method SWISS-MODEL
Model Template Sequence identity QMEAN QMEANDisCo VoroMQA
model_1.pdb 3TSY 66.42% -2.19 0.71 0.504224
model_2.pdb 3TSY 69.36% -2.41 0.71 0.496802

Therefore, to model the complete structure of our fusion protein sequence, we tried using ab initio modeling methods (I-TASSER [7], RoseTTAFold [2], and trRosetta [8]) instead of homology-based modeling approaches. Eventually, the initial modeling was performed only using trRosetta and RoseTTAFold because I-TASSER did not complete the queries we uploaded to their server in May and June. Additionally, the standalone version could not be accessed at the time.

Lastly, we attempted to use both versions of trRosetta program: server and standalone. We found 3DFI [9] Perl scripts created to run the program and generate structure files in PDB format when the input sequence in FASTA is provided. We tested the program and managed to run it for a test target - 4CL protein. However, the standalone version of trRosetta did not finish the modeling of our fusion protein. The reason for this might be that the complex is simply too big, and the computer that we got access to run the program on had got little RAM (15 GB in total with 7 GB swap) with only 8 CPUs. There has also been an issue closed on Github [47] about the similar problem that trRosetta takes a long time on large MSA’s. It was explained that even running the program on a machine with GPU might throw an ‘out of memory’ error for MSA with ~600 residues. Meanwhile, our system is composed of roughly 1000 residues, and the computer did not contain any GPU; therefore, it is clear why we did not manage to run this program in our case.

RoseTTAFold modeling approach can be accessed via the Robetta server when the RoseTTAFold option is chosen. The first release of RoseTTAFold became open on Github as a standalone version of the program on 4th July; however, it requires over 400 GB of disc space. Since we did not have such resources, we could not use the standalone version of the program.

Before getting access to AlphaFold2, we initially attempted to model structures using RoseTTAFold without providing manually created multiple sequence alignment (MSA) files. In our project development, we had consultations with researchers that specialize in protein structural bioinformatics, and we were suggested to create our own MSA files. The mentioned approach should represent our fusion protein system not partially but as a whole more accurately (a more detailed description of this modeling approach can be found in the further sections). Since RoseTTAFold does not provide the MSA files that it generates by default, only an assumption can be made that these files might not accurately represent the region of the sequence that is located near the linker. Therefore, we wrote scripts to generate MSA files with template sequences for our fusion proteins connected with our chosen linker.

Scoring modeled complexes

The model of the complex had to be chosen according to its evaluation scores. We had a few consultations about the protein complex modeling and each of the consulting specialists of this field emphasized that we should apply at least several scoring systems, hence the best quality model would be chosen.

VoroMQA

VoroMQA [10] - the method is based on partitioning space into Voronoi cells. Two adjacent Voronoi cells share a set of points that form a surface called a Voronoi face, which can be viewed as a geometric representation of contact between two atoms. VoroMQA determines the quality of protein structural models using inter-atomic and solvent contact areas, including the knowledge-based statistical potential.

ProSA

ProSA [11] is a program that uses knowledge-based potentials of mean force to evaluate the accuracy of the input model. The structure’s energy is evaluated using a distance-based pair potential and a potential that captures the solvent exposure of protein residues. From these energies, two characteristics are derived: a z-score and a plot of its residue energies.

The z-score indicates overall model quality and measures the structure’s total energy deviation concerning an energy distribution derived from random conformations. Z-scores outside a range characteristic determined by structures of native proteins indicate errors and low quality of the model that is being evaluated.

TM-Score

TM-Score [12] is a template modeling score is a variation of the Levitt-Gerstein (LG) score, which weights shorter distances between corresponding residues more strongly than longer distances.

TM-align

TM-align [46] is an algorithm that employs TM-Score and dynamic programming to generate the best structural alignment between two protein structures.

CAD-Score

The definition of CAD-score [13] is based on the three considerations:

  1. contacts in the model should be evaluated according to the contacts in the reference structure (target);
  2. any missing residues in the model should be treated in the same way as if none of their contacts were correctly predicted;
  3. strong over-prediction (nonphysical overlap) of a particular contact should be equivalent to missing that contact entirely (in other words, there should be no false-positive predictions in terms of contact).

QMEAN

QMEAN [48] is an acronym for Qualitative Model Energy ANalysis. There are five different descriptors used for this evaluation approach, which are:

  1. torsion angle potential;
  2. distance-dependent pairwise residue potential;
  3. solvation potential;
  4. the term describing agreement of predicted and calculated secondary structure;
  5. the term describing solvent accessibility.

QMEANDisCo

In addition to QMEAN composite scoring, QMEANDisCo [14] introduces a new distance constraint (DisCo) that evaluates the agreement of pairwise distances and constraints taken from homologous experimentally determined protein structures.

ProQ3D

ProQ3D [15] is a single-model quality estimation program. ProQ estimated the quality of the whole model using a machine learning approach - support-vector machine - by summing up the predicted qualities for each residue. ProQ2 included profile weights to improve the predictions. ProQ3 included energy terms calculated from Rosetta.

Overall, ProQ3D uses the same inputs as in ProQ3. Although, a neural network is used to assess the quality of the protein structure instead of a support vector machine.

Ramachandran plot analysis

Protein structural scientists use Ramachandran plots to gain a better insight into the secondary structure of peptides. It represents the combinations of φ and ᴪ dihedral angles. Most pairs of φ and ᴪ are sterically forbidden since they introduce interatomic clashes. Outlines mark allowed regions in the graph. The plots were generated using RamachanDraw library. We thought it would be helpful to add the customization functionality; therefore, we wrote a small software tool RamachanDrawX that allows us to plot Ramachandran plots more efficiently and customize them for the desired design.

Structure refinement using molecular dynamics

After modeling and evaluating structures with several scoring functions, there is usually a need to refine structures after modeling in silico to avoid potential clashes between atoms and make the design more stable energetically. However, our protein complexes are rather big; they are composed of 962 - 974 residues ( 173 610 - 329 648 atoms without solvate particles). After consulting with a molecular dynamics specialist working with protein structures, we found out that at least 100 nanoseconds of simulations would be needed for our fusion protein system, which is a computationally costly procedure.

We chose GROMACS program to perform molecular dynamics simulations since it is freely available, versatile, and has an extensive documentation. Our initial choice of molecular dynamics protocol was based on a tutorial published in Bioinformatics Review [23]. As mentioned, we organized a consultation with a bioinformatics scientist who specializes in molecular dynamics. After the meeting, we had all the answers to the questions raised related to this part of modeling and received many suggestions on how to get the most out of molecular dynamics simulation. Generally, after the consultation, we adjusted our molecular dynamics protocol to be the one described in the table below. Explanations for each of the used commands can be found in the Perl script written for this molecular dynamics flow.

Table 3. Adjustments to the molecular dynamics simulation steps
Before consultation After consultation Command
Preparing protein file by removing other chains and heterogenous atoms `grep`
Converting protein's PDB file to Gromacs format GRO `pdb2gmx`
Choosing force-field CHARMM27 Choosing force-field AMBER94
Choosing water model SPC Choosing water model TIP3
Defining a box with 2 Å distance between protein each wall of the dodecahedron type box Defining a box with 1.2 Å distance between protein and each wall of the dodecahedron type box `editconf`
Solvating protein `solvate`
Adding ions to neutralize the system, selecting group SOL Adding ions to neutralize the system `grompp`, `genion`
Adding ions to simulate water with dissolved salt (concentration 150 mM)
Energy minimization step to reach negative potential energy and maximum force to be less than 1000 kJ/mol/nm `grompp`, `mdrun`, `energy`
Temperature equilibration for 100 ps, 300 K `grompp`,`mdrun` `energy`
Pressure equilibration for 100 ps, 300 K `grompp`,`mdrun` `energy`
Production run for 1 ns Production run for 100 ns grompp`, `mdrun`

MARTINI coarse-grained simulations

Since we want to calculate the distance between active sites of our linked proteins, the cases with rigid linkers are relatively trivial. Although, we had to dig deeper about the approach to calculate distance between active sites in a flexible linker case. The initial idea was to find the maximum distance between active centers in the flexible linker case. This approach requires all possible conformations to find the one that provides the maximum distance between proteins. On the other hand, there might be particular conformations that are more likely than others. In any case, molecular dynamics simulations were needed to perform to catch the shifts in protein conformations.

However, since molecular dynamics simulations need to be performed on the large protein system consisting of roughly 1000 residues, the atomic molecular dynamics might last around a week on an 80 core system. Therefore bioinformatician V. Kairys suggested using MARTINI scripts and performing coarse-grained simulations. This kind of approach is less computationally expensive. The third version of MARTINI scripts is released, although we opted for the second version since it has more tutorials. Additionally, since Python3 is more commonly used nowadays, we chose martinize.py script adjusted for this Python version.

Modeling fusion system components

4CL

Since we take 4CL protein from Glycine max organism, we had to model the structure for our protein. Initially, we attempted to do this by using two programs: SWISS-MODEL [3] and RoseTTAFold. When AlphaFold2 [35] became freely available in July, we decided to model our chosen 4CL protein using this prominent tool. For this purpose, we used Google Colab notebook [36], in which AlphaFold2 was implemented, and we received three models in unrelaxed and relaxed states. According to the documentation found in DeepMind’s AlphaFold2 repository, the relaxed state contains a structure made after Amber relaxation procedure made on an unrelaxed structure - model provided right after modeling with AlphaFold2. It was not stated whether the average plDDT score provided as output is calculated for the unrelaxed or relaxed state. However, the models representing relaxed states were generally evaluated better.

Table 4. Evaluation of 4CL modeled structures
Model trRosetta QMEAN4 QMEANDisCo VoroMQA score
4CL.sqit.pdb -0.94 0.67 0.478739
Model SWISS-MODEL QMEAN4 QMEANDisCo VoroMQA score
4CL_1.pdb -0.73 0.84 0.541558
4CL_2.pdb -0.61 0.82 0.519713
4CL_3.pdb -0.34 0.85 0.549002
Model RoseTTAFold QMEAN4 QMEANDisCo VoroMQA score
MODEL_1_4CL2_SOYBN.pdb 1.71 0.77 0.540512
MODEL_5_4CL2_SOYBN.pdb 1.44 0.77 0.540305
MODEL_2_4CL2_SOYBN.pdb 1.6 0.76 0.539742
MODEL_3_4CL2_SOYBN.pdb 1.61 0.77 0.533458
MODEL_4_4CL2_SOYBN.pdb 1.79 0.76 0.530254
Table 5. Evaluation of 4CL modeled using RoseTTAFold with MSA structures
Model QMEAN4 QMEANDisCo VoroMQA score
MODEL_1_4CL_MSA.pdb 1.88 0.77 0.541581
MODEL_2_4CL_MSA.pdb 1.12 0.75 0.526281
MODEL_3_4CL_MSA.pdb 1.09 0.76 0.529501
MODEL_4_4CL_MSA.pdb 0.95 0.76 0.534835
MODEL_5_4CL_MSA.pdb 1.41 0.76 0.529382
Table 6. Evaluation of 4CL modeled using AlphaFold2
Model QMEAN4 QMEANDisCo VoroMQA score Average plDDT
4CL_unrelaxed_model_1.pdb -0.62 0.79 0.546974 89.651979
4CL_unrelaxed_model_2.pdb -1.01 0.79 0.549646 89.276523
4CL_unrelaxed_model_3.pdb -1.02 0.79 0.546767 88.780811
4CL_relaxed_model_1.pdb 0.13 0.8 0.551421 -
4CL_relaxed_model_2.pdb 0.3 0.8 0.552571 -
4CL_relaxed_model_3.pdb -0.86 0.8 0.549808 -
Fig. 6. Coverage and predicted lDDT per position of 4CL structure modeled with AlphaFold2
Fig. 7. Predicted alignment error of 4CL structure modeled with AlphaFold2

The model generated with AlphaFold2, when overlaid with homology-based modeling created structure, showed a structural mismatch at the C terminal from 453rd residue until the last one (562nd) in AlphaFold2 model; meanwhile, the design of SWISS-MODEL did not have the complete sequence modeled.

Fig. 8. 4CL structures modeled with SWISS-MODEL (left) and AlphaFold2 (right). The darker shade of blue indicates higher structure determination quality (B-factor spectrum)
Fig. 9. Disordered region of 4CL structure modeled with AlphaFold2 (electric) overlay with SWISS-MODEL model (pink)

CHS

As mentioned in the description of the initial modeling approach, chalcone synthase (CHS) from A. thaliana has got its structure determined experimentally. However, after reading the CASP14 analysis about refinement category [25, 26], which analyzed the accuracy of AlphaFold2 models. The study pointed out that AlphaFold2 results might indicate crystal lattice contacts - the locations in the experimentally determined structure that are not as accurate as the rest of the protein. Therefore, it was intriguing to compare CHS’ experimental structure with its AlphaFold2 model and check whether the crystal lattice contacts of the experimental structure can be observed in our case.

Table 7. Evaluation of CHS modeled using AlphaFold2 [34]
Model QMEAN4 QMEANDisCo VoroMQA score Average plDDT
CHS_unrelaxed_model_1.pdb 0.08 0.85 0.535824 96.86796
CHS_unrelaxed_model_2.pdb -0.13 0.85 0.542367 96.62286
CHS_unrelaxed_model_3.pdb -0.45 0.84 0.536282 96.34417
CHS_relaxed_model_1.pdb 0.2 0.86 0.542643 -
CHS_relaxed_model_2.pdb 0.21 0.86 0.539895 -
CHS_relaxed_model_3.pdb 0.2 0.86 0.540226 -

For modeling to be optimal, AlphaFold2 should have at least 30 sequences coverage at each position. Meanwhile, plDDT measures how well the environment in the reference structure is represented in the modeled protein structure. Apart from the first positions in the N terminal, predicted lDDT was highly evaluated in each position of CHS protein.

Fig. 10. Coverage and predicted lDDT per position of CHS structure modelled with AlphaFold2
Fig. 11. Predicted alignment error of CHS structure modelled with AlphaFold2

The experimental structure did not have the design of the first six residues. This might be because the N terminal part of the sequence was considered to be a disordered region, just as it was with the C terminal of 4CL protein in the case of PDB: 3TSY. According to the PyMOL [50] B-factor [41] coloring of PDB: 6DXB structure, the less accurate residues were found in positions: 7-8, 67, 83-84, 208-209, 271-276, 295-299, 327-328 (Table 8). Meanwhile, the B-factor coloring showed that only the first eight in the N terminal and residues in positions 210-212 were inaccurate. Since the crystal packing can distort local protein structure, it can be assumed that there were crystal lattice contacts in the places of inaccurately represented residues. These residues were observed in the experimental structure and not in the model generated with AlphaFold2.

Table 8. Inaccurate residues according to PyMOL B-factor coloring
Experimental structure Modeled structure
7-8 1-8
67 -
83-84 -
208-209 210-212
271-276 -
295-299 -
327-328 -
Fig. 12. B-factor coloring for CHS experimental structure
Fig. 13. B-factor coloring for CHS structure modeled with AlphaFold2

Visually observed differences between experimental structure and the best model from AlphaFold2 were seen in residues: 235-237 (partial alpha helix was not obtained in the model), 261-263, 269-273 (beta-sheets were introduced in the model), 383-385 (the beta-sheet was slightly bent in the model).

Fig. 14. Structural differences between experimental CHS structure and the best model of AlphaFold2

Initial fusion system modeling

We attempted to model fusion protein systems with seven different linkers: GSG, EAAAK, (EAAAK)2, (EAAAK)3, GGGGS, (GGGGS)2, and (GGGGS)3. Each structure was modeled using two ab initio protein modeling tools: trRosetta and RoseTTAFold. The structures were evaluated using VoroMQA scoring function. ProSA z-scores are not provided for all structures since the web-server was temporarily unavailable when we employed this modeling approach. The results are represented in the tables below. All structures were analyzed visually with PyMOL visualization program. The modeling results were not satisfactory due to the large portion of 4CL being uncoiled in the fusion protein model. The model for each linker case was chosen according to VoroMQA score.

We attempted to use Yasara [45] program to perform energy minimization. The output of Yasara did introduce a slight structural shift, which was observed with PyMOL command `cealign` when the minimized model was compared with the original 4CL:GSG:CHS model. However, the VoroMQA score of the original complex was higher (0.516429) than of the one that was generated by Yasara program (0.512734). Since the structural change was not significant, we did not apply this program to other structures.

Fig. 15. Structural alignment using PyMOL of 4CL:GSG:CHS structures before (electric) and after (pink) energy minimization with Yasara

Besides Yasara, we also attempted to do structural refinement using molecular dynamics simulations. Since molecular dynamics is a computationally expensive procedure, we tried performing a trivial version of simulations on our target system. In order to test this technique, we used steps described in the section about the molecular dynamics approach before consulting the specialist of the field.

The results of the modeling and refinement flow for 4CL:GSG:CHS case are presented in the supplementary data file.

Modeling strategy with MSA

After consulting specialists in the protein modeling field and receiving fresh ideas about improving our modeling workflow, we were suggested including multiple sequence alignment (MSA) as an additional input into our chosen modeling programs. Nonetheless, both trRosetta and RoseTTAFold generate MSA internally, yet these MSA files do not represent the linked protein case accurately enough because the output models still had a large portion of the disordered region between the linked proteins. Therefore, we were encouraged to use sequence-search methods to create MSA files for our modeling input ourselves.

Bioinformaticians from Stockholm University presented PconsDock approach [6], showing that the fold-and-dock method can model a pair of proteins linked via poly-G between two chains. This technique resembled our problem, thus we attempted to include into the constructed MSA files not only the exact linker sequence yet also the prolonged with glycines on both sides version of it. We wrote scripts to generate multiple sequence alignment files adjusted to our system of fused proteins. The script that produces MSA files was written in Perl. This script takes as input two a3m files, a linker sequence, and a boolean option whether to extend the linker sequence or not. The variation to extend linker sequence with poly-G was included in the script’s functionality, although this modeling method could not be applied practically; the extended linker system was not intended to be checked in the wet-lab experimentally. Therefore we continued this modeling flow only with the exact linker sequence included in the MSA. Full query-template a3m MSA files were obtained from HHblits [40] server and taken as an input to our script. The script reads the MSA file of the first protein and saves taxonomic IDs of the species to which the sequences belong. Afterward, the script reads the MSA file of the second protein, checks whether the taxonomic ID of the sequence has a match in the first MSA file, and if the match is found, the sequences are merged via the linker sequence. The final multiple sequence alignment file is printed on the screen of the terminal.

Results of modeling approach with MSA

The results of the modeling with MSA files included can be found in supplementary data file.

Modeling with AlphaFold2

The highly accurate protein structure prediction tool AlphaFold2 evaluated in CASP14 was published in Nature to be freely open to the public on 15th July 2021. Since the program requires extensive computational resources, we could not install it and run it locally. However, Kliment Olechnovič, the bioinformatician from Vilnius University, suggested that we model our protein complexes using a Google Colab notebook which has AlphaFold2 adjusted to protein complexes. According to the CASP14 evaluation of model refinement category [25, 26], AlphaFold2 generated structures require no structure refinement in most cases. During the modeling software assessment, the errors observed in the modeled structures were mainly found at crystal lattice contacts, where the experimentally determined structure is likely to be unrepresentative of biological conformations. Therefore molecular dynamics simulations for AlphaFold2 modeled structure refinement are unnecessary.

The modeling with AlphaFold2 was performed using its generated MSA files with MMseq2. The composed MSA files covered each part of the fusion protein system by at least 500 sequences, which is a good result because the best modeling results are achieved when there is coverage of 30 (ideally 100) sequences at each position (based on the notes in Google Colab notebook ).

AlphaFold2 has five parameter sets for modeling, thus in most cases, it can produce five distinct modeling variants. Although our fusion protein system is composed of around 1000 residues that make up a rather large complex, the Colab version of AlphaFold2 gave us two unrelaxed models as an output for each linker case. In general, the quality of the models was impressive, according to plDDT scoring; the score was mostly high except for the linker region.

Predicted aligned error (PAE) gives information about the relative orientations of different parts of the modeled structure [49]. The PAE plot shows that each modeled part of the fusion protein system - 4CL, the linker, and CHS - are modeled reliably within each portion, yet their positions relative to each other are not evaluated confidently. Therefore, only an assumption can be made that rigid linkers take up the modeled orientation. However, rigid linker domains and terminals of the linked proteins might have disordered parts that significantly change the conformation. Although, this assumption had to be made because we had limited access to computational resources for molecular dynamics simulations to receive a better view of the system. The main purpose of this modeling workflow is to find distances between active sites of the fused proteins. Therefore we required molecular dynamics simulations to determine the most likely conformations of protein systems, and due to computational limitations, we restricted this idea to the systems with flexible linkers.

Ramachandran plots show whether there are no steric clashes in the structures. Apparently, the given plots show that the clashes are present in the modeled structures. The reason for this is that the models are not relaxed (Amber force-field was not applied).

Linker GSG

Table 9. Evaluation of 4CL:GSG:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_GSG_CHS_unrelaxed_model_1.pdb -1.01 0.78 0.539915
4CL_GSG_CHS_unrelaxed_model_2.pdb -1.18 0.77 0.537669
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_GSG_CHS_unrelaxed_model_1.pdb 0.709 0.625 0.705
4CL_GSG_CHS_unrelaxed_model_2.pdb 0.701 0.633 0.801 0.721
Fig. 16. Coverage and predicted lDDT per position of 4CL:GSG:CHS structure modeled with AlphaFold2
Fig. 17. Predicted alignment error of 4CL:GSG:CHS structure modeled with AlphaFold2
Fig. 18. Ramachandran plot of the best 4CL:GSG:CHS structure modeled with AlphaFold2
Fig. 19. Illustrating distance between 4CL:GSG:CHS active sites

Linker EAAAK

Table 10. Evaluation of 4CL:EAAAK:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_EAAAK_CHS_unrelaxed_model_1.pdb -0.69 0.79 0.542077
4CL_EAAAK_CHS_unrelaxed_model_2.pdb -1.03 0.79 0.536476
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_EAAAK_CHS_unrelaxed_model_1.pdb 0.708 0.624 0.794 0.710
4CL_EAAAK_CHS_unrelaxed_model_2.pdb 0.7 0.63 0.798 0.717
Fig. 20. Coverage and predicted lDDT per position of 4CL:EAAAK:CHS structure modeled with AlphaFold2
Fig. 21. Predicted alignment error of 4CL:EAAAK:CHS structure modeled with AlphaFold2
Fig. 22. Ramachandran plot of the best 4CL:EAAAK:CHS structure modeled with AlphaFold2
Fig. 23. Illustrating distance between 4CL:EAAAK:CHS active sites

Linker (EAAAK)2

Table 11. Evaluation of 4CL:(EAAAK)2:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_EAAAK_2_CHS_unrelaxed_model_1.pdb -0.78 0.79 0.541834
4CL_EAAAK_2_CHS_unrelaxed_model_2.pdb -1.16 0.78 0.536158
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_EAAAK_2_CHS_unrelaxed_model_1.pdb 0.69 0.638 0.81 0.718
4CL_EAAAK_2_CHS_unrelaxed_model_2.pdb 0.693 0.616 0.789 0.691
Fig. 24. Coverage and predicted lDDT per position of 4CL:(EAAAK)2:CHS structure modeled with AlphaFold2
Fig. 25. Predicted alignment error of 4CL:(EAAAK)2:CHS structure modeled with AlphaFold2
Fig. 26. Ramachandran plot of the best 4CL:(EAAAK)2:CHS structure modeled with AlphaFold2
Fig. 27. Illustrating distance between 4CL:(EAAAK)2:CHS active sites

Linker (EAAAK)3

Table 12. Evaluation of 4CL:(EAAAK)3:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_EAAAK_3_CHS_unrelaxed_model_1.pdb -1.36 0.78 0.537969
4CL_EAAAK_3_CHS_unrelaxed_model_2.pdb -1.38 0.77 0.533997
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_EAAAK_3_CHS_unrelaxed_model_1.pdb 0.693 0.617 0.795 0.697
4CL_EAAAK_3_CHS_unrelaxed_model_2.pdb 0.693 0.634 0.803 0.713
Fig. 28. Coverage and predicted lDDT per position of 4CL:(EAAAK)3:CHS structure modeled with AlphaFold2
Fig. 29. Predicted alignment error of 4CL:(EAAAK)3:CHS structure modeled with AlphaFold2
Fig. 30. Ramachandran plot of the best 4CL:(EAAAK)3:CHS structure modeled with AlphaFold2
Fig. 31. Illustrating distance between 4CL:(EAAAK)3:CHS active sites

Linker GGGGS

Table 13. Evaluation of 4CL:GGGGS:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_GGGGS_CHS_unrelaxed_model_1.pdb -0.75 0.78 0.542
4CL_GGGGS_CHS_unrelaxed_model_2.pdb -1.16 0.77 0.536974
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_GGGGS_CHS_unrelaxed_model_1.pdb 0.691 0.617 0.779 0.698
4CL_GGGGS_CHS_unrelaxed_model_2.pdb 0.691 0.640 0.809 0.725
Fig. 32. Coverage and predicted lDDT per position of 4CL:GGGGS:CHS structure modeled with AlphaFold2
Fig. 33. Predicted alignment error of 4CL:GGGGS:CHS structure modeled with AlphaFold2
Fig. 34. Ramachandran plot of the best 4CL:GGGGS:CHS structure modeled with AlphaFold2
Fig. 35. Illustrating distance between 4CL:GGGGS:CHS active sites

Linker (GGGGS)2

Table 14. Evaluation of 4CL:(GGGGS)2:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_GGGGS_2_CHS_unrelaxed_model_1.pdb -1.02 0.78 0.540667
4CL_GGGGS_2_CHS_unrelaxed_model_2.pdb -0.9 0.78 0.53278
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_GGGGS_2_CHS_unrelaxed_model_1.pdb 0.697 0.635 0.807 0.716
4CL_GGGGS_2_CHS_unrelaxed_model_2.pdb 0.697 0.629 0.802 0.725
Fig. 36. Coverage and predicted lDDT per position of 4CL:(GGGGS)2:CHS structure modeled with AlphaFold2
Fig. 37. Predicted alignment error of 4CL:(GGGGS)2:CHS structure modeled with AlphaFold2
Fig. 38. Ramachandran plot of the best 4CL:(GGGGS)2:CHS structure modeled with AlphaFold2
Fig. 39. Illustrating distance between 4CL:(GGGGS)2:CHS active sites

Linker (GGGGS)3

Table 15. Evaluation of 4CL:(GGGGS)3:CHS structures modeled with AlphaFold2
Model QMEAN QMEANDisCo VoroMQA
4CL_GGGGS_3_CHS_unrelaxed_model_1.pdb -1.25 0.78 0.540568
4CL_GGGGS_3_CHS_unrelaxed_model_2.pdb -1.36 0.77 0.531981
Model ProQ2D ProQRosCenD ProQRosFAD ProQ3D
4CL_GGGGS_3_CHS_unrelaxed_model_1.pdb 0.696 0.627 0.801 0.707
4CL_GGGGS_3_CHS_unrelaxed_model_2.pdb 0.687 0.626 0.804 0.719
Fig. 40. Coverage and predicted lDDT per position of 4CL:(GGGGS)3:CHS structure modeled with AlphaFold2
Fig. 41. Predicted alignment error of 4CL:(GGGGS)3:CHS structure modeled with AlphaFold2
Fig. 42. Ramachandran plot of the best 4CL:(GGGGS)3:CHS structure modeled with AlphaFold2
Fig. 43. Illustrating distance between 4CL:(GGGGS)3:CHS active sites

Conclusion

It is possible that molecular dynamics simulations might bring a more accurate view about distances between protein active sites, since MD simulations take into account movements of flexible parts in the fusion complex. However, the complexes are large structures, thus they would require powerful computational resources.

We compared the distances that we captured in our modeling flow and Kcat/Km values of 4CL and STS complex analyzing the respective flexible linker cases from literature [39]. In the case of 4CL linked with STS, Kcat/Km values decrease for the linkers ordered as GSG, GGGGS, (GGGGS)2, (GGGGS)3. If we continued with the hypothesis that was set in the beginning, the tendency of increasing distances for linkers ordered in the mentioned way cannot be observed in Table 16. This comparison does not bring a reliable insight, since the modeled complex does not match the one in the literature. What is more, the rigid linker cases are not included.

Unfortunately, naringenin was not produced in our wet lab, therefore we decided not to run molecular dynamics simulations. Additionally, the initially planned correlation analysis of distances between protein active sites and the production of naringenin measure could not be made.

Table 16. Summary of distance between active sites in the models
Linker Distance between active sites (Å)
GSG 40.1
GGGGS 38.8
(GGGGS)2 39.4
(GGGGS)3 39.2
EAAAK 42.7
(EAAAK)2 42.7
(EAAAK)3 47.6

Alternative modeling approaches

Reverse linking GSG

After doing a visual analysis of the initially modeled complexes, we saw that the C terminal of 4CL structure in a fusion protein is unfurled in all cases of linkers; therefore, we started to look for ways to improve our modeling approach. One of the ideas was to perform reverse linking and check how the target domain behaves when the proteins are linked in the reverse order. Although the results showed an even more disrupted structure of linked proteins, therefore we rejected this approach. It can be seen that the linker is found in the modeled domain of CHS structure. Additionally, both structures CHS and 4CL are not structurally aligned to their native structures.

The results of the reverse linking for 4CL:GSG:CHS case are presented in the supplementary data file.

Separate modeling

The idea of the separate structural modeling was to model fusion proteins with a chosen rigid linker separately and align the distinct structures afterward. The naive expectation was to get modeled sequences aligned via a portion of the linker sequence. We decided to test this technique with the trivial case of rigid linker - EAAAK. However, due to the short length of sequence that was intended to be aligned, PyMOL did not perform `cealign` command successfully.

The results of the separate modeling for 4CL:GSG:CHS case are presented in the supplementary data file.

Modeling linkers as unknown amino acids

28th July S. Ovchinnikov shared that AlphaFold2 is capable of modeling structures with unknown amino acids (labeled as ‘U’). Therefore we decided to give this feature a try and model structures with poly-U chains as linkers with N equal to 3, 5, 10, and 15. Although it was noted that the relaxation procedure is unavailable when unknown amino acids are in the structure.

The results of modeling with unknown amino acids in the place of linkers show that linkers of 3 and 5 amino acids provide protein orientation that is similar to the one modeled with known amino acids. Meanwhile, linkers of 10 and 15 unknown amino acids doom the higher degree of freedom to the orientation of the complex.

Scores of the models with unknown aminoacids can be found in supplementary data

Fig. 44. Fusion protein structures modeled with 3 (upper left), 5 (upper right), 10 (lower left), 15 (lower right) unknown amino acids, 4CL (pink), CHS (dark green) with their distances

Modeling strategy with docking

Another approach that was suggested is to try applying protein docking. For this approach, steps would be to use protein-protein docking software, generate a significant number of conformations, filter conformations based on the distance between linked ends according to the length of the linker that the structure is modeled for, pick the filtered conformations, evaluate quality with previously used scoring functions. This approach was not applied due to the shortage of time.

References

1.
Shamriz, S., & Ofoghi, H. (2016). Design, structure prediction and molecular dynamics simulation of a fusion construct containing malaria pre-erythrocytic vaccine candidate, Pf CelTOS, and human interleukin 2 as adjuvant. BMC bioinformatics, 17(1), 1-15. To the article.
2.
Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., ... & Baker, D. (2021). Accurate prediction of protein structures and interactions using a 3-track network. bioRxiv. To the article. To the tool.
3.
Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., ... & Schwede, T. (2018). SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic acids research, 46(W1), W296-W303. To the article.
4.
Lu, P., & Feng, M. G. (2008). Bifunctional enhancement of a β-glucanase-xylanase fusion enzyme by optimization of peptide linkers. Applied microbiology and biotechnology, 79(4), 579-587. To the article.
5.
Wiederstein, M., & Sippl, M. J. (2007). ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic acids research, 35(suppl_2), W407-W410. To the article.
6.
Pozzati, G., Zhu, W., Bassot, C., Lamb, J., Kundrotas, P., & Elofsson, A. (2021). Limits and potential of combined folding and docking using PconsDock. bioRxiv. To the article.
7.
Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., & Zhang, Y. (2015). The I-TASSER Suite: protein structure and function prediction. Nature methods, 12(1), 7-8. To the article. To the tool.
8.
Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., & Baker, D. (2020). Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3), 1496-1503. To the article.
9.
3DFI scripts to run trRosetta To the tool.
10.
Olechnovič, K., & Venclovas, Č. (2019). VoroMQA web server for assessing three-dimensional structures of proteins and protein complexes. Nucleic acids research, 47(W1), W437-W442. To the article.
11.
Li, G., Huang, Z., Zhang, C., Dong, B. J., Guo, R. H., Yue, H. W., ... & Xing, X. H. (2016). Construction of a linker library with widely controllable flexibility for fusion protein design. Applied microbiology and biotechnology, 100(1), 215-225. To the article.
12.
Xu, J., & Zhang, Y. (2010). How significant is a protein structure similarity with TM-score= 0.5?. Bioinformatics, 26(7), 889-895. To the article.
13.
Olechnovič, K., & Venclovas, Č. (2020). Contact area-based structural analysis of proteins and their complexes using CAD-score. In Structural Bioinformatics (pp. 75-90). Humana, New York, NY. To the article.
14.
Studer, G., Rempfer, C., Waterhouse, A. M., Gumienny, R., Haas, J., & Schwede, T. (2020). QMEANDisCo—distance constraints applied on model quality estimation. Bioinformatics, 36(6), 1765-1771. To the article.
15.
Uziela, K., Menendez Hurtado, D., Shu, N., Wallner, B., & Elofsson, A. (2017). ProQ3D: improved model quality assessments using deep learning. Bioinformatics, 33(10), 1578-1580. To the article.
16.
Fusion protein modelling, Profacgen. To the source.
17.
Fusion protein modelling, Parics Saclay iGEM 2014. To the Wiki.
18.
Dry Lab, Modelling, IIT Roorkee iGEM 2020. To the Wiki.
19.
Protein structure prediction, Western Canada iGEM 2019. To the Wiki.
20.
Structural model, IISER Bhopai iGEM 2020. To the Wiki.
21.
Protein modelling, Toronto iGEM 2017. To the Wiki.
22.
Structural modelling, Warsaw iGEM 2009. To the Wiki.
23.
Dr. Muniba Faiza, Tutorial: Molecular dynamics (MD) simulation using Gromacs, 2019. To the source.
24.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 1-11. To the article.
25.
Simpkin, A. J., Rodríguez, F. S., Mesdaghi, S., Kryshtafovych, A., & Rigden, D. J. (2021). Evaluation of model refinement in CASP14. Proteins: Structure, Function, and Bioinformatics. To the article.
26.
Heo, L., Janson, G., & Feig, M. Physics‐Based Protein Structure Refinement in the Era of Artificial Intelligence. Proteins: Structure, Function, and Bioinformatics. To the article.
27.
Pesce, F., & Lindorff-Larsen, K. (2021). Refining conformational ensembles of flexible proteins against small-angle X-ray scattering data. bioRxiv. To the article.
28.
Collu, G., Bierig, T., Krebs, A. S., Engilberge, S., Varma, N., Guixa-Gonzalez, R., ... & Benoit, R. M. (2020). Chimeric single α-helical domains as rigid fusion protein connections for protein nanotechnology and structural biology. bioRxiv. To the article.
29.
Jeong, W. H., Lee, H., Song, D. H., Eom, J. H., Kim, S. C., Lee, H. S., ... & Lee, J. O. (2016). Connecting two proteins using a fusion alpha helix stabilized by a chemical cross linker. Nature communications, 7(1), 1-9. To the article.
30.
Chen, X., Zaro, J. L., & Shen, W. C. (2013). Fusion protein linkers: property, design and functionality. Advanced drug delivery reviews, 65(10), 1357-1369. To the article.
31.
Peptide Hydrophobicity/Hydrophilicity Analysis, Peptide2.0. To the source.
32.
PeptideCutter, Expasy, SIB Swiss Institute of Bioinformatics. To the tool.
33.
PEP-FOLD3. To the tool.
34.
Dunstan, M. S., Robinson, C. J., Jervis, A. J., Yan, C., Carbonell, P., Hollywood, K. A., ... & Scrutton, N. S. (2020). Engineering Escherichia coli towards de novo production of gatekeeper (2 S)-flavanones: naringenin, pinocembrin, eriodictyol and homoeriodictyol. Synthetic Biology, 5(1), ysaa012. To the article.
35.
AlphaFold2, DeepMind, GitHub. To the tool.
36.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 1-11. To the article.
37.
Sergey Ovchinnikov, Milot Mirdita, & Martin Steinegger. (2021, July 22). ColabFold - Making Protein folding accessible to all via Google Colab (Version v1.0-alpha). Zenodo. To the tool.
38.
Reddy Chichili, V. P., Kumar, V., & Sivaraman, J. (2013). Linkers in the structural biology of protein–protein interactions. Protein Science, 22(2), 153-167. To the article.
39.
Guo, H., Yang, Y., Xue, F., Zhang, H., Huang, T., Liu, W., ... & Ma, L. (2017). Effect of flexible linker length on the activity of fusion protein 4-coumaroyl-CoA ligase:: stilbene synthase. Molecular BioSystems, 13(3), 598-606. To the article.
40.
Zimmermann, L., Stephens, A., Nam, S. Z., Rau, D., Kübler, J., Lozajic, M., ... & Alva, V. (2018). A completely reimplemented MPI bioinformatics toolkit with a new HHpred server at its core. Journal of molecular biology, 430(15), 2237-2243. S0022-2836(17)30587-9 To the article.
41.
Johnson, T. W., Gallego, R. A., Brooun, A., Gehlhaar, D., & McTigue, M. (2018). Reviving B-Factors: Retrospective Normalized B-Factor Analysis of c-ros Oncogene 1 Receptor Tyrosine Kinase and Anaplastic Lymphoma Kinase L1196M with Crizotinib and Lorlatinib. ACS medicinal chemistry letters, 9(9), 878–883. To the article.
42.
Mariani, V., Biasini, M., Barbato, A., & Schwede, T. (2013). lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21), 2722-2728. To the article.
43.
Hiranuma, N., Park, H., Baek, M., Anishchenko, I., Dauparas, J., & Baker, D. (2021). Improved protein structure refinement guided by deep learning based accuracy estimation. Nature communications, 12(1), 1-11. To the article.
44.
Zhou, W., Šmidlehner, T., & Jerala, R. (2020). Synthetic biology principles for the design of protein with novel structures and functions. FEBS letters, 594(14), 2199-2212. To the article.
45.
Land H., Humble M.S. (2018) YASARA: A Tool to Obtain Structural Guidance in Biocatalytic Investigations. In: Bornscheuer U., Höhne M. (eds) Protein Engineering. Methods in Molecular Biology, vol 1685. Humana Press, New York, NY. To the article.
46.
Zhang, Y., & Skolnick, J. (2005). TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic acids research, 33(7), 2302-2309. To the article.
47.
trRosetta GPU issue To the source.
48.
Benkert, P., Tosatto, S. C. E., & Schomburg, D. (2008). QMEAN: A comprehensive scoring function for model quality assessment. Proteins: Structure, Function, and Bioinformatics, 71(1), 261–277. doi:10.1002/prot.21715 To the article.
49.
Sameer Velankar, Gerard Kleywegt, Alex Bateman (2021). AlphaFold Protein Structure Predictions - a Step Change for Biology. EMBL’s European Bioinformatics Institute (EMBL-EBI). To the article.
50.
DeLano, W. L. (2002). Pymol: An open-source molecular graphics tool. CCP4 Newsletter on protein crystallography, 40(1), 82-92. To the article.