Why are we doing this?
With lab access stunted by the pandemic, our team wanted to maximize our effectiveness in the lab by reducing the amount of trial and error required to successfully produce a rare earth element detection system. Although inspired by literature, our novel fusion proteins needed to be constructed sophisticatedly, with understanding and purpose. Thus, when we first encountered the problem of what linkers to attach the fusion protein together, we wanted to develop a comprehensive workflow to evaluate and justify the linkers we used. This ultimately aimed to increase our chances of developing a successful measurement system in the limited time that we had. Using the engineering cycle, this documentation goes through the revisions and improvements we made in our workflow to make it versatile and applicable to future iGEM teams who want to create their own novel fusion proteins.
Starting on the Nanoluc Luciferase system:
Through our preliminary research, we were quick to understand that the most optimal linkers would be related to the proximity of the functional motifs with respect to each other. To maximize the fluorescence intensity for the luciferase system, we needed to minimize the distance between the functional motifs [1].
What linkers did we choose?
After understanding what the aim was, a list of potential linkers was created based on their properties, design and functionality. Additionally, because we believed that size of the linker would make the most significant difference, we decided to make sure each linker varied in length between 5 and 15 amino acids.
The power of modelling
Since the lanmodulin is attached to a large bit and a small bit luciferase, this would require a combination of potential linkers to choose which would be most optimal. Using the list of potential linkers in a csv, we were able to create a script for inserting a unique linker in between the terminal ends. The outcome was a list of 25 unique fusion protein sequences coming from 5 potential linkers.
Given the list of different sequences, we needed to now characterize and visualize each structure. Ideally, this could be done through trial and error by ordering each sequence, producing protein, and running nuclear magnetic resonance (NMR) on each one to get a three dimensional structure. Because this would be extremely resource and time expensive, we decided to implement a comprehensive workflow for structure prediction.
Introducing Homology
The basis of homology or comparative modelling, is using parts of verified protein structures and comparing them together to get an estimate as to how a new protein model would look. To begin, we used the Basic Local Alignment Search Tool (BLAST), a tool from the National Center for Biotechnology Information (NCBI) that provides certified proteins with similar regions compared to the sequence you put in [2]. To help reduce overlap and conflicting structures we chose a combination of proteins that would give us the most coverage in the least number of selected proteins. We also wanted to be as consistent as possible in the structures we use for each fusion protein as our goal is to compare the linkers used, not the different parts. Because of this, we had three default proteins, one to cover lanmodulin (6MI5), the small bit and the large bit luciferase. From here, the fourth and potentially fifth proteins would vary based on the linkers used. Once we chose the proteins, we exported the collection in the format of a FASTA alignment, and uploaded them to CLUSTALW for Multiple Sequence Alignment [3]. This essentially aligns the protein sequences chosen in the right format so that they can be used to create a 3D structure. Finally, we used UCSF Chimera to develop our three-dimensional structures [4].
ZDOPE and P/O
Chimera then outputs five varying protein structures in which we choose the best and worst models to represent each protein based on their zDOPE scores. Normalized Discrete Optimized Protein Energy (DOPE) is a statistical score that uses atomic distance-dependence where negative values represent better models [5]. The reason for our best and worst model representation is to reduce bias in our estimates and have an estimated range instead of a singular number. These two options would act as our pessimistic and optimistic choices for each set of spacers, allowing us to average the two and get a more reliable value for the spacing we could expect in our fusion protein.
Linker distance finder
Once we have produced an optimistic and pessimistic model, an automated script was created to measure distances for all the structures created. We then used specific amino acids located in the small bit and large bit luciferase. Next, by using the coordinates of the amino acid provided by the pdb file, a vector was created to measure the distance between the two. Because these pdb files contain thousands of lines of coordinates and we have to go through each one of them, we decided to write our script in C++ for significant run time improvement compared to high-level programming languages like Python. Using the range of distances for each fusion protein and the 3-dimensional models, we were able to evaluate and cut down the potential pool of fusion proteins.
Molecular Docking
For the remaining fusion proteins left, we ran molecular docking simulations with neodymium ligands. The purpose of this step is to evaluate the binding affinity of each protein. This was a way for us to see if lanmodulin was still able to bind with rare earth metals while still being fused to the luciferase parts. It also helped us understand if there was a correlation between the size of the linker and lanmodulin’s ability to fold.
Our Process
After preliminary research we were presented with two choices to run our molecular docking simulations, AutoDock4 and AutoDock Vina [6]. We found limitations while using AutoDock Vina as it did not allow us to add new atom parameters to the program. This is needed as the properties of neodymium and other rare earth metals are not built into the program by default. Because of this issue, we decided to use the more characterized program, Autodock4.
As a part of the molecular docking process, we first had to run each protein through AutoDockTools. This application was used to prepare docking parameter files which would tell AutoDock4 which search algorithm to use and the extent of the calculations. Fundamentally, there are four different search algorithms provided by AutoDock4, simulated annealing(SA), genetic algorithm(GA), local search(LS) and a Lamarckian Genetic algorithm (LGA) [7]. Although the Monte Carlo simulated annealing provides potential for efficient evaluations, LGA shows big improvement on both genetic methods due to its specialty in ligand-receptor docking and versatility in the number of degrees of freedom [8]. Finally, once the ligands were fitted into their designated places with the correct charge, we were able to run molecular docking for each protein resulting in binding energy values, and root-mean-square-deviation values similar to the run listed below:
Figure 1. Results from Molecular Docking
After running simulations for the remaining proteins, we found a slight decrease in binding energy due to the size of the luciferase bits compared to lanmodulin. Additionally, this part of our modelling pipeline allowed us to remove faulty proteins due to their poor binding energy.
Molecular Dynamics
Our final evaluation tool used GROMACS version 18.02 on the University of Calgary ARC Computing Cluster, a software used to perform molecular dynamics [9]. These computer simulations are used for analyzing the physical movements of molecules and atoms. Additionally, we can use these simulations to better understand the eventual movement of the protein and an evaluation on if the protein is stable. For the remaining proteins, we developed 3 nanosecond MD simulations.
Molecular Dynamic Simulations were carried out by the team using the following general scheme with commands in parenthesis [10].
Convert PDB file to GRO file (gmx pdb2gmx)
Generate empty box to have 1.00 nanometers extra room around the protein (gmx editconf)
Solvate the protein with water using the spc216 water approximation (gmx solvate)
Generate ions to ensure the system has neutral charge (gmx genion)
Perform Energy Minimization (gmx mdrun with energy minimization mdp file)
Perform isothermal-isochoric equilibration (gmx mdrun with isothermal-isochoric equilibration mdp file)
Perform isothermal-isobaric Equilibration (gmx mdrun with isothermal-isobaric Equilibration mdp file)
Perform Molecule Dynamics (gmx mdrun with molecular dynamics mdp file)
Consequently, each simulation produces a pdb file every 10 picosecond. Using the same script as before, we were able to measure the distances between the functional motifs. These ranges were used to more accurately predict the precise fluctuation within the protein and verify our previous predictions. Additionally, they were used to better understand the eventual movement of the protein and if the structure was stable.
Results
Ultimately through the rigorous tests we used above, we were finally able to find the most ideal set of linkers for our luciferase fusion protein. This protein consisted of a length of 8 amino acids connected to the large bit and 6 connecting to the small bit.
Figure 2.Our resulting Luciferase Fusion Protein
Introduction to BRET:
When running the same workflow through the BRET system, we realized that we could make some more improvements to the script which created the unique protein sequences by creating a place for users to implement their own parts and list of linkers. When using BLAST to find overlap between other proteins against others, we were unsuccessful in finding significant coverage leading to very poor 3D models. Therefore a more reliable structural modelling workflow was needed for more unique fusion proteins. Although many open source structure prediction services exist, many of them are often inaccurate.
Robetta
Introducing Robetta, a protein structure prediction service that provides a variety of different algorithms. As referenced before, comparative modelling or homology, uses existing models to predict what a new structure would look like. Alternatively, Ab Initio is typically recommended for protein structures that don’t have any coverage [11]. This algorithm produces a chain of amino acids then uses a fragment assembly method to fold the chain into its tertiary structure.10 Next was RoseTTAFold, a structure prediction algorithm that was recently developed in July of this year. Similar to its predecessor AlphaFold2, RoseTTAFold uses deep neural networks to accurately predict protein structures [12].
Finally, to evaluate the accuracy of each algorithm, we used angstrom error estimates, a tool that shows how much a part of the protein will vary. Ultimately, we found RoseTTAFold to be significantly better than the other methods due to the accurate models produced.
Following the rest of the workflow, we were able to successfully evaluate and identify the most optimal linker combination for our BRET System.
Figure 3.Our resulting BRET Fusion Protein
Conclusion
Ultimately, we developed this process as a way of modelling as an effective tool for characterizing and evaluating fusion protein structures in their early stages. Along the way, we created a generalized workflow that other teams or researchers can use as a part of their pipeline when creating their own novel proteins.
In the future we aim to make the workflow require less programs so that results can be produced faster allowing for more tests. Using open source code known as colabfold, which uses both rosettaFold and AlphaFold2, we can develop a program that is able to create all the different sequence combinations, develop 3D models, and measure distances between motifs.
Promega Team. NanoBiT® PPI Starter Systems. Promega Corporation. [accessed 2021 Oct 21]. https://www.promega.ca/products/protein-interactions/live-cell-protein-interactions/nanobit-ppi-starter-systems/?catNum=N2014
BLAST Team. Blast: Basic local alignment search tool. National Center for Biotechnology Information. [accessed 2021 Oct 21]. https://blast.ncbi.nlm.nih.gov/Blast.cgi
CLUSTAW Team. Multiple sequence alignment - clustalw. GenomeNet icon. [accessed 2021 Oct 21]. https://www.genome.jp/tools-bin/clustalw
Chimera Team. UCSF Chimera - an Extensible Molecular Modeling System . UCSF Chimera Home Page. [accessed 2021 Oct 21]. https://www.cgl.ucsf.edu/chimera/
Chimera Team. Comparative Modeling Tutorial. Comparative modeling tutorial. [accessed 2021 Oct 21]. https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/dor.html
Morris GM. Autodock. AutoDock. 2013 Feb 27 [accessed 2021 Oct 21]. http://autodock.scripps.edu/
Morris GM. How to prepare a docking parameter file for AutoDock4. Autodock. 2007 Aug 27 [accessed 2021 Oct 21]. http://autodock.scripps.edu/faqs-help/how-to/how-to-prepare-a-docking-parameter-file-for-autodock4-1?searchterm=genetic%2Balgorithm
Morris GM, Goodsell DS, Halliday RS, Huey R, Hart WE, Belew RK, Olson AJ. Automated Docking Using a Lamarckian Genetic Algorithm and an Empirical Binding Free Energy Function. Jay Ponder Lab. 1998 Jun 24 [accessed 2021 Oct 21]. https://dasher.wustl.edu/chem430/readings/jcc-19-1639-98.pdf
GROWMACS Team. About GROMACS. Gromacs. [accessed 2021 Oct 21]. http://www.gromacs.org/About_Gromacs
Team Calgary 2020. ENDOGLUCANASE 1. iGEM Calgary. [accessed 2021 Oct 21]. https://2020.igem.org/Team:Calgary/Endoglucanase1
Hardin C, Pogorelov TV, Luthey-Schulten Z. Ab initio protein structure prediction. Current Opinion in Structural Biology. 2002 Apr 11 [accessed 2021 Oct 21]. https://www.sciencedirect.com/science/article/abs/pii/S0959440X02003068?casa_token=5x6dvbehIyEAAAAA%3AR7z4eQfhUq7WuovP63hMEVGbjWWp2f2KLNQ6Y6YBk4AnxTcSjCvGkW5FAsKFMnBmDZtBej4YpQ
Admin. Researchers unveil 'phenomenal' new AI for predicting protein structures. (GSTDTAP): Researchers unveil 'phenomenal' new AI for predicting protein structures. 1970 Jan 1 [accessed 2021 Oct 21]. http://resp.llas.ac.cn/C666/handle/2XK7JSWQ/333289
Schrodinger Team. Pymol by Schordinger. PyMOL. [accessed 2021 Oct 21]. https://pymol.org/2/