Team:SYSU-Software/Design

<!DOCTYPE html> Team:SYSU-Software/Design - 2021.igem.org

Loading menubar.....

Team:SYSU-Software/Design

<!DOCTYPE html>

Overview

Phoebus, names from Phoebus Apollo, is a platform that help users control bio-chemical processes with light. Phoebus now has three functional modules, and a Phoebus Community.

For now, there is three modules in Phoebus, and users can upload more supplementary modules to it.

The first module is Opto-controllable Elements Designer, the heart of Phoebus. It consists of five parts: an opto-switch protein database, a linker database, a structure prediction algorithm, a fusion protein activity evaluator, and a sequence optimizer. After uploading a target protein, users should then choose an opto-switch protein to react to light, and then choose an appropriate linker to link the target protein to the opto-switch protein. They can use our structure prediction algorithm to calculate the structure of the fusion protein, and use our fusion protein activity evaluator to check if the fusion protein still has its origin activity. After this step, an opto-controllable element is successfully designed. Users can use our sequence optimizer to output DNA sequence as the final step.

The core module: Opto-controllable Elements Designer

The second module is Cascade Reaction Rate Control Module. As we mentioned in Description, it is a supplementary module that guide users how to design an opto-controllable elements system that can quantitively control cascade reaction rate with light. When the target enzymes and the opto-switch proteins are known, we can return [light intensity] – [reaction rate] curve for users.

The third module is Transcription Control Module. This is another supplementary module we build to help users control gene expression with light. We build a transcription factor database in it, so users can search for TFs that bind with target gene, and build opto-controllable TF with which users can control gene expression with light.

Besides these modules, any user can update our platform and upload more supplementary modules. We don't want Phoebus limited to what it is when we created it; instead, it should be a child curiously learning from the world, and growing up little by little. It is not our Phoebus, it is the whole world's Phoebus.

And for users to communicate with and inspire each other, we also build Phoebus Community, a place for sharing and Q&A.

Now let's take a look carefully at all these functions of Phoebus!

Opto-Switch Protein Database

Photosensitized proteins are a class of protein molecules that can change their structure after being stimulated by light at a specific wavelength. Photosensitive proteins are found in a variety of organisms in nature and enable organisms to respond to changes in environmental light. Some photosensitive proteins undergo heterodimerization with another protein after receiving light stimulation at a specific wavelength, while others undergo homodimerization or multimerization with the same kind of protein, and still others undergo metamorphosis in the light, exposing a helical structure masked before.

Based on the expected three functions, we searched for available photosensitive proteins and their variants in published articles1, and in the process we also found some platforms for integrating information on included photosensitive proteins. Depending on the wavelength of sensed light and the function, our project pre-integrated three types of photosensitive proteins, namely CRY-like proteins, LOV-like proteins, and others.

Our opto-switch protein database

We collected the structures of dozens of photosensitive proteins with their sequence information, size, excitation light length, mutation sites, mutation function, and references. It is organized into a list, which is easy for users to intuitively select the photosensitive protein structures they need, and very convenient to interface with downstream functions.

SLinker: Linker Database

Linker is a sequence that link two domains and prevent them from inter-domain interference. Its length and the kinds of amino acid containing determines its function. Longer linkers are more hydrophilic and exposed in the aqueous solvent than short linkers[2]. According to protein secondary structure, linkers can be separated to helical linker and non-helical linker[2]. In a α-helix structure, Helix linkers are able to separate domains as rigid space, which decreasing inter-domain interference. Non-helical linkers are rich in Pro without fixed rigid structure.

- Linker Online Database

Up to now, there are three public linker online databases: LINKER[3], IBIVU LinkerDataBase[2] and Synlinker[4]. However, in addition to IBIVU LinkerDataBase, other databases are inaccessible. The IBIVU LinkerDataBase was established in 2002, and the data is not updated in time. Its analysis is based on the protein structure at that time. Since then, CATH, SCOP and other protein structure analysis tools and three-dimensional protein prediction tools as Rosetta, AlphaFold have been developed to provide more accurate and reasonable analysis of protein secondary structure, super secondary structure, and tertiary structure. Therefore, there is currently no online natural linker database based on the latest protein structure and updated in real time.

- Slinker Database

Based on the data from the National Center for Biotechnology Information (NCBI) and Conserved Domain Database (CDD), we obtained the linkers by subtracting the conserved domain sequence from the full-length sequence of a single protein, and sorted data was placed in the database SLinker, so that the user can select the desired linker according to requirements such as length range, solvent accessibility, terminal amino acid type etc.

We used the cd-search tool in CDD to obtain the conserved domains of the 4799 multi-domain protein sequences obtained in NCBI, and then preprocess the searched data to filter out unqualified conserved domains and search for maximum of non-overlapping conserved domains in a protein sequence. If the number of conserved domains is 1, the full-length sequence is still a conserved domain and there is no linker. If the number is greater than 1, these conserved domain sequences are subtracted from the full-length sequence, and the first and last sequences are removed, and the final result is the linker sequence.

We collected 7124 linkers, among which there are 3940 linkers with amino acid residue length from 26 to 100. Due to the limitation of computing power, we use trRosetta 5to predict the structure of 500 linkers to obtain the three-dimensional structure, and use FreeSASA6 to calculates the solvent accessibility of these linkers.

In order to make the SLinker database more complete, we have added empricial linkers which have been reported[7-12], and classified as rigid linker and flexible linker[13].

Finally, we get a database Slinker, each linker records primary sequence, sequence length, three-dimensional structure, solution accessibility, and related conservative domain information. Users can determine the length range, solvent accessibility, terminal amino acid type and other requirements to obtain the desired linker sequence and display it.

Our Slinker database

Structure Prediction Algorithm

In the pre-project phase, we collected many published structure prediction methods and procedures. We chose the ab initio method for structure prediction because there are linker sequences in the proteins we need to predict that are difficult to homologate, which makes it difficult to predict the structure of fusion proteins by homology method or threading method. We searched for representative algorithms in ab initio computation: GDfold[14], AlphaFold[15], and trRosetta[5]. After referring to the literature and the evaluation and description of the principles of these algorithms in CASP, we finally chose trRosetta, which is open source and has better evaluation accuracy, as the algorithm for structure prediction in Phoebus.

trRosetta is an improved structure prediction program based on AlphaFold. Compared with AlphaFold, which uses only the distance between residues for prediction[15], trRosetta uses both the distance and direction between residues[5], exploiting richer structural data, and thus is able to achieve more accurate prediction results. Theoretically trRosetta can perform the full set of calculations on a desktop computer, but we finally chose to rent a server to run the structure prediction module because of time cost and other considerations.

trRosetta predicts the distance and orientation relationships between amino acid residues from sequence alignment files, and translates the above relationships into smooth Rosetta restriction parameters that will be used in subsequent energy minimization modeling to minimize the energy of the folded structure by gradient descent to obtain the predicted protein structure[5].

Reliability assessment of structural prediction

We used the TM score (Template modeling score) for the reliability assessment of the prediction. Tm-score is a metric used to detect the structural similarity of two proteins.

The TM-Score value is between 0 and 1, with 1 meaning that the two are in complete agreement. When the score is below 0.17, it means that they are completely different structures chosen at random; when the score is greater than 0.5, it means that the two fold similarly.

We collected nine proteins in the PDB database published after May 1, 2018, and scored them for structure prediction accuracy. The results prove that the predicted structures are reliable. Meanwhile, we found that trRosetta achieves better prediction accuracy for shorter sequences and slightly lower prediction accuracy for longer sequences, but still reliable (TM score > 0.6). Also, such results are in agreement with our knowledge that the longer the amino acid sequence, the more exponentially the possible ways of peptide chain folding increase, and the more difficult it is to predict the interactions between residues.

The result of fusion protein structure prediction in out software: Phoebus.

To learn more about our structural prediction algorithm reliability assessment, click here.

Activity Evaluator

After designing the fusion protein, we also wanted to explore whether the domain of the fusion protein could perform their original functions, so we created the optional module of activity prediction.

The most important factor for a protein to function properly is whether the structure of its active site has been significantly altered. We started from the active site and evaluated the activity of the fusion protein by comparing the differences in the structure of the active site and its surroundings. For the determination of the active site, we first looked up some databases that recorded the active site information of enzymes, such as M-CSA, and tried to build a database with these data in our software to retrieve the active sites of some user-entered proteins. However, we gave up the idea of integrating the active site database because the active site residue numbers provided in the M-CSA database were difficult to correspond to the protein sequences in the PDB database.

We subsequently found the parameter B-factor, which represents the average position of the atoms, in the protein data documented by the PDB file. The B-factor describes the displacement of the atomic positions from an average (mean) value (mean-square displacement). Higher flexibility results in larger displacements, and eventually lower electron density. The core of the molecule usually has low B-factors, due to tight packing of the side chains (enzyme active sites are usually located there). From this, we averaged the B-factor of all atoms in a residue to obtain the B-factor value of the residue, and sorted the residues according to the B-factor from smallest to largest, and took the residues ranked in the top 20% as the preliminary screening results of the active site. Meanwhile, we learned from the literature that the active sites of natural enzymes have a preference for residue types, and His/Cys/Asp/Glu/Arg/Lys are more often used as active sites[16], so the above residue sites in the preliminary screening results were selected as the results of the secondary screening. We present the sites for secondary screening to the user by marking them in red in the full-field sequence of the protein, and the user can select any residue in the full-length with a single click, and support multiple residues to be selected at one time; after that, the user needs to enter the radius he/she wants to compare, or use the default radius for the subsequent operation if he/she does not enter it. Then, we will draw a sphere with the selected sites as the center of it and the entered value as the radius, and use all residues inside the sphere as the input of CAD-score, so as to carry out the subsequent CAD-score calculation.

- CAD-score[17]

We used the CAD-score as an index to evaluate the structural similarity of the active site before and after protein fusion. The CAD-score assesses the difference in contact area between the central residue and the surrounding residues between the two structures.

Two CAD-scores are output to the user: the CADA-A-score, which indicates the contact area difference between all atoms of the central residue and all surrounding atoms, and the CADA-S-score, which indicates the contact area difference between all atoms of the central residue and the side chains of the surrounding residues. A 3D image of the overlapping placement of the fusion protein and the non-fusion protein is also outputed, allowing the user to obtain both a quantitative score and to visually determine the structural alteration of the active site of the fusion protein by the degree of 3D structural similarity.

- Cascade Reaction Rate Control Module

The function of this module is just like its name: help users control cascade reaction rate with light. The final goal is that, after users determine their target reactions, our software will help them rebuild the enzymes catalyzing this reaction into opto-controllable elements, and provide them with the [light intensity] - [reaction rate] curve. We hope to provide better solutions for researchers to quantitatively and dynamically regulate the metabolic pathways.

To achieve this goal, first, we ask our users to upload the 2(or 3) enzymes that catalyze the target 2(or 3) step reaction to our opto-controllable element designer. We ask our users to select opto-switch proteins that dimerize after light exposure(for example CRY2), so that the designed opto-controllable elements will heterodimerize with light. These 2(or 3) opto-controllable elements in PDB format are then uploaded to this module. Then, it will invoke the activity evaluation algorithm, which automatically calculates the distance between activity-related residues after user determine the active sites. Before running the model, users also need to enter some other parameters (kinetic parameters of the enzyme involved and the concentration of the enzyme and substrate) and can modify other default parameters. Then, run the model and the [light intensity] – [enzyme dimerization percentage] function is automatically nested with a three-step reaction rate model to obtain the final [light intensity] – [reaction rate] function. The user can enter their desired reaction rate and we return the light intensity that should be used. Note that this intensity is characterized in terms of equivalent optical energy density, which can be regulated by adjusting the light intensity or the light-dark oscillation period. To learn more about our model, click here.

How did we come up with this idea?

At the very first stage of designing our project, we decided to build an opto-controllable elements designer which we hope will be a tool with infinite possibilities, because theoretically, users can control the activity of every protein with our designer. But we soon realized that users may need more specific supplementary tools if they are working with a specific situation. For example, if our users want to control cascade reaction rate though light intensity, they need to design hetero-polymerization opto-controllable elements with our designer, and also an algorithm to calculate the cascade reaction rate. Tools like this only work in specific situations, but can make our platform much more useful. Since our users may have different kinds of requirements, it is impossible for us to build all these supplementary tools. So, we recommend our users to build open-source tools they find useful themselves, and combine it with our software.

In May, our partner SYSU-China reminded us that protein proximity can speed up cascade reaction, and that’s why there are so many enzyme complex in nature. So, we seriously considered the possibility that our users may want to build a opto-controllable element can heteromerize to control reaction rate accurately with light, and so decided to build this moduleas an example supplementary tool for our users. After literature research, we confirmed that enzyme proximity can really accelerate cascade reaction, and we find a model describing the kinetic of an enzyme proximity two-step reaction system.

How does this module work?

The function of light intensity and reaction rate can be disassembled into two parts for characterization: the relationship between light intensity and the amount of heterodimerization of the opto-controllable elements, and the relationship between the amount of heterodimerization of the opto-controllable elements and the change in reaction rate.

To calculate the relationship between light intensity and the amount of opto-controllable elements heterodimerization, we first assume that the kinetic parameters of opto-controllable elements heterodimerization and that of the opto-switch protein are identical, i.e. linking the target protein to the opto-switch protein does not affect the dimerization process of it. Such a simplified model is necessary because it is impossible to calculate the dimerization behavior of all kinds of photoreceptors. Then, after literature research, we confirmed that the light intensity determines the proportion rather than the absolute number of dimerized opto-switch protein. Because our users may use different kinds of opto-switch proteins that can dimerize with light, we also require them to input the light intensity versus dimerization percentage curve of the opto-switch proteins they use. This can come from the user's own experiments, or from previous research, or from work shared by others within the Phoebus Community. We have also prepared a curve as an example of how to use our software, but please note that this curve may be very different from the behavior of your opto-switch protein!

Next, we will deal with the relationship between the amount of heterodimerization of the opto-controllable element and the rate increase of the final reaction. As mentioned above, we believe that light intensity determines the proportion of opto-switch proteins that undergo dimerization. Based on this, we believe that we can derive the proportion of the corresponding total reaction velocity change within a cell by calculating the proportion of opto-switch proteins that undergo dimerization and the proportion of the reaction velocity change(post-aggregation velocity/original velocity*100%) before and after each unit of opto-controllable element dimerization.

After completing the problem analysis and model disassembly, we started to build the main part of the model: the proportion of the reaction velocity change before and after each unit of opto-controllable element dimerization. As mentioned earlier, there have been several articles that have studied how the spatial proximity of enzymes affects the reaction rate and how the diffusion of intermediates in a crowded intracellular environment is affected.

For the reason that the concentration of intermediate is relatively higher in the neighborhood of the enzyme upon which it is produced, it’s necessary for us to calculate the distribution of intermediate to evaluate the acceleration brought by opto-switch protein oligomerization-induced proximity.

Schematic diagram of intermediate reaction, orange color density reflects the concentration of intermediate

After many iterations of fumbling, designing, building, failing, learning, and designing (see Engineeringfor more details), we decided not to build a model for calculating the diffusion coefficient, but to use a default value (5 x 10^7 nm^2 x s^-1) which value can be modified by users. We further scrutinized the similarities and differences between the scenario described in the original enzyme proximity article and the one we faced, and reconstructed our own model.

Improvements we made to the original model: while the original model was only applicable to two-step reaction systems, we wanted to extend it to scenarios where three enzymes are involved and only dimerization (both homo- and hetero-dimerization) can occur between any enzymes. Such a unique model does not exist in previous papers; indeed, even simple models describing enzyme proximity in three-step reactions are not available on the web. Thus, the difficulties faced are many. We made some new reasonable assumptions as the basis of our model , and start building it. (see Modelfor detailed information).

It is very difficult to solve such a special three-step reaction, in fact, we were unable to obtain a meaningful analytical solution after running the program for ten days and nights continuously using Matlab, Mathematica, Python, etc. After dimensionless processing, we were finally able to solve it numerically with our program (see Engineeringfor more details).

Among the nine types of input parameters required for the model, information on the kinetic parameters of the enzymes, the concentration of substrate and enzyme are required to be entered by the user; the diffusion coefficients describing the diffusion behavior of intermediates 1 and 2 are provided by our software as default values or can be modified by the user; the inner surface area of the activity sites of the two enzymes and the distance between them can be obtained by our model automatically.

Next, we listed the reaction equations corresponding to the model based on our assumptions, and then combined with the information on the spatial distribution of intermediate product concentrations around the corresponding enzyme and the surface area of the enzyme active site, we can find the set of functions for the concentration of each substance in the reaction system. By solving this set of equations numerically, we can calculate the increase of the total rate of the reaction.

- Transcription Control Module

In this module, Phoebus helps to design the fusion proteins of the transcription factor and the opto-controllable oligomeric protein (for example, CRY2). In the absence of light, the transcription factors are normally expressed and functional; while in the presence of 450 nm light, oligomerization of CRY2 leads to the aggregation of transcription factors, which are inactivated due to the site-blocking effect, thus enabling the control of the concentration of active transcription factors in the cell by light. The user's input is a transcription factor or gene and Phoebus would output the fusion protein sequence. The transcription factor databases we plan to build include FootprintDB, Cistrome DB, TB database (TBDB), hTFtarget, TRANSFAC, etc.

Phoebus Community

Phoebus has already made great effort in providing users with a user-friendly interface as well as a set of accessible designing and product-evaluating strategies. Except from these core functions, in order to fully achieve our project goal: "To make Phoebus a platform of infinite possibilities", we design "Phoebus Community" to make direct connections with our target users.

Inside our community, users can register as either "researchers" or "other jobs". After filling in basic information and validating their identities, users are free to upload any of their ideas and suggestions to make Phoebus a better place to address real scene challenges by developing more functional modules and plug-ins. Users can also make comments and communicate with each other. Making it a real community for them to exchange thoughts and develop collaborations is what we are looking forward to.

Most importantly, User's privacy is well protected. Users that are identified as researchers can alternatively express their inspirations and upload experiment results to all users involved or be accessible to other researchers only. With this function, we hope to protect individuals' intellectual property.

Finally, there's a discussion board for all kinds of users. Anyone can release their problems faced in designing opto-controllable elements or in the experiment processes. We will check those discussions regularly, trying our best to complete the feedback-design loop to optimize Phoebus’ functional design. A notice board is also available to all, where we will update changes and modulations, as well as other guidelines to help users make full use of Phoebus.

All in all, Phoebus Community provides users with opportunities to communicate and collaborate with other scientists and assists to improve our software's functional design at the same time.

References

(1) Kolar, K., Knobloch, C., Stork, H., Znidaric, M. & Weber, W. OptoBase: A Web Platform for Molecular Optogenetics. ACS Synth Biol 7, 1825-1828, doi:10.1021/acssynbio.8b00120 (2018).

(2) George, R. A. & Heringa, J. An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng 15, 871-879, doi:10.1093/protein/15.11.871 (2002).

(3) Crasto, C. J. & Feng, J. A. LINKER: a program to generate linker sequences for fusion proteins. Protein Eng 13, 309-312, doi:10.1093/protein/13.5.309 (2000).

(4) Liu, C., Chin, J. X. & Lee, D. Y. SynLinker: an integrated system for designing linkers and synthetic fusion proteins. Bioinformatics 31, 3700-3702, doi:10.1093/bioinformatics/btv447 (2015).

(5) Yang, J. Y. et al. Improved protein structure prediction using predicted interresidue orientations. P Natl Acad Sci USA 117, 1496-1503, doi:10.1073/pnas.1914677117 (2020).

(6) Mitternacht, S. FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Res 5, 189, doi:10.12688/f1000research.7931.1 (2016).

(7) Huston, J. S. et al. Protein engineering of antibody binding sites: recovery of specific activity in an anti-digoxin single-chain Fv analogue produced in Escherichia coli. Proc Natl Acad Sci U S A 85, 5879-5883, doi:10.1073/pnas.85.16.5879 (1988).

(8) Bai, Y. & Shen, W. C. Improving the oral efficacy of recombinant granulocyte colony-stimulating factor and transferrin fusion protein by spacer optimization. Pharm Res 23, 2116-2121, doi:10.1007/s11095-006-9059-5 (2006).

(9) Sabourin, M., Tuzon, C. T., Fisher, T. S. & Zakian, V. A. A flexible protein linker improves the function of epitope-tagged proteins in Saccharomyces cerevisiae. Yeast 24, 39-45, doi:10.1002/yea.1431 (2007).

(10) Lu, P. & Feng, M. G. Bifunctional enhancement of a beta-glucanase-xylanase fusion enzyme by optimization of peptide linkers. Appl Microbiol Biotechnol 79, 579-587, doi:10.1007/s00253-008-1468-4 (2008).

(11) Bergeron, L. M., Gomez, L., Whitehead, T. A. & Clark, D. S. Self-Renaturing Enzymes: Design of an Enzyme-Chaperone Chimera as a New Approach to Enzyme Stabilization. Biotechnol Bioeng 102, 1316-1322, doi:10.1002/bit.22254 (2009).

(12) de Bold, M. K. et al. Characterization of a long-acting recombinant human serum albumin-atrial natriuretic factor (ANF) expressed in Pichia pastoris. Regul Pept 175, 7-10, doi:10.1016/j.regpep.2012.01.005 (2012).

(13) Chen, X., Zaro, J. L. & Shen, W. C. Fusion protein linkers: property, design and functionality. Adv Drug Deliv Rev 65, 1357-1369, doi:10.1016/j.addr.2012.09.039 (2013).

(14) Mao, W. Z., Ding, W. Z., Xing, Y. G. & Gong, H. P. AmoebaContact and GDFold as a pipeline for rapid de novo protein structure prediction. Nat Mach Intell 2, 25-33 (2020).

(15) Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706-710, doi:10.1038/s41586-019-1923-7 (2020).

(16) Ribeiro, A. J. M. et al. Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res 46, D618-D623, doi:10.1093/nar/gkx1012 (2018).

(17) Olechnovic, K., Kulberkyte, E. & Venclovas, C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 81, 149-162, doi:10.1002/prot.24172 (2013).