Team:SYSU-Software/Engineering

☰

Engineering Success

Idea Research Design - opto-switches database - linker - structure prediction - activity prediction - application Verification Maintenance

Idea：How we came up with the idea of Phoebus?

We understand that in many experiments, protein behavior needs to be precisely controlled in order to explore protein function, so is it possible to achieve our vision by linking the structural domain of the light-controlled protein to the target protein through a linker? Based on this, we designed a questionnaire and put it on the iGEMers' exchange group in China to collect opinions, obtain the feasibility of the project and refine the details of the project. From the data collected in the questionnaire, more than half of them think our idea is feasible; and for the use of the linker part, we found that most of the respondents use the linker figured out in the lab, thus, we hope to present a more personalized and intelligent linker design method in our software.

Figure 1 Questionnaire results.|

Research：How we achieve this idea?

There are two broad approaches to control protein behavior: controlling the amount of protein produced by controlling the expression of the corresponding gene; controlling the behavior of the protein by directly controlling the active site by some physical or chemical methods. The former is slow and complex; the latter is simple and efficient, but may cause irreversible damage to other cellular functions. By reviewing the articles, we discovered the photoswitch, a tool whose structure can change according to the change of light.

First, we define the experimental compartments using in non-neuronal opto-genetics as “opto-controllable elements”, then we create the entire design work flow of opto-controllable elements with standardized bio-bricks to help biologists to make use of non-neuronal optogenetics easier, faster and more reliable. Finally, Phoebus was born.

Design：What modules we need to achieve the target?

We thought that Phoebus needs a database of opto-switches, linker design, fusion protein structure prediction, and activity prediction as four basic modules to achieve our purpose.

- opto-switches database

Photosensitized proteins are a class of protein molecules that can change their structure after being stimulated by light at a specific wavelength. Photosensitive proteins are found in a variety of organisms in nature and enable organisms to respond to changes in environmental light. Some photosensitive proteins undergo heterodimerization with another protein after receiving light stimulation at a specific wavelength, while others undergo homodimerization or multimerization with the same kind of protein, and still others undergo metamorphosis in the light, exposing a helical structure masked before.

Based on the expected three functions, we searched for available photosensitive proteins and their variants in published articles, and in the process we also found some platforms for integrating information on included photosensitive proteins. Depending on the wavelength of sensed light and the function, our project pre-integrated three types of photosensitive proteins, namely CRY-like proteins, LOV-like proteins, and others.

Figure 2 | Wavelength of sensed light and its function

We collected the structures of dozens of photosensitive proteins with their sequence information, size, excitation light length, mutation sites, mutation function, and references. It is organized into a list, which is easy for users to intuitively select the photosensitive protein structures they need, and very convenient to interface with downstream functions.

designing: Before we started building the software, we learned that many opto-controllable proteins can undergo conformational changes, homopolymerization or heteropolymerization. We expect to collect as much data as possible to enrich our list.

building: Based on the goals in D, we manually collected a large database of opto-controllable elements and reviewed a large amount of publications and database to obtain information on many these proteins.

testing: After collecting these information, we used it testing for practical application ideas and found that，first，it was difficult to complete the subsequent design with two types of opto-controllable proteins other than the homopolymerized ones; second, relying on natural light-controllable proteins alone would make our list extraordinarily barren and narrow in application.

learning: To solve the problems found in T, our decisions were that，first to continue to collect opto- controllable proteins that can undergo homopolymerization meanwhile abandoning the other two, thereby completing the majority of the experimental design; second to expand the scope of the search beyond the naturally existing opto-controllable proteins and broaden it to artificially modified opto-controllable proteins, and to collect detailed information on their mutation sites, mutational functions, etc.

- linker

In the beginning, we wanted to know how to design a fusion protein and thus learned about linker. After reading the relevant paper and querying the database such as IBIVU LinkerDataBase and Synlinker (inaccessible), we referred to the design idea of Synlinker and used the full-length sequence of the multi-domain protein minus the sequence of the conserved domain to obtain the primary structure of the linker in order to construct the natural linker database.

By the use of NCBI and CDD, we got massive data of linker, which contains repeat and overlap sequences. Therefore, we process the searched data, screen out unqualified conserved domains, and search for the largest number of non-overlapping conserved domains in a protein sequence. If the number of conserved domains is 1, it is still the full-length sequence and a conserved domain does not exist Linker, if the number is greater than 1, then subtract these conserved domain sequences from the full-length sequence, and remove the first and last sequences, and the final result is the sequence of the natural linker.

In order to make the database more complete, we have added empirical linkers which have been reported, and classified as rigid linker and flexible linker.

In order to understand the linker more deeply, we use trRosetta to predict the three-dimensional structure of some linkers and conduct a preliminary analysis of the primary structure data of the linker. We found that the linker within a certain length range shows a specific amino acid preference, which may be a potential way for identifying the linker.

- structure prediction

We use trRosetta for structure prediction. Compared to other open source structure prediction algorithms, trRosetta uses two types of information on inter-residue distances and angles and six parameters for structure prediction, which has a higher accuracy. The purpose of using the algorithm for fusion proteins is to determine whether the photoswitch domain and the target protein are still two relatively independent structural domains by the structure of the fusion protein, and if they are spatially separated from each other, it can be tentatively assumed that they still maintain their respective action activities.

testing: We evaluated the accuracy of the structure prediction algorithm using proteins uploaded in the PDB database after May 1, 2018, with the evaluation parameter TM-score. The TM-scores of the nine proteins tested are shown in table1 and figure1, and all the TM-scores are larger than 0.5, implying that all structures predicted are reliable

Figure 3 | TM-score of the predicted protein

Figure 4 | Histogram of the TM-score of the predicted protein

- activity prediction

Activity prediction as an optional module provides the user with further predictions of fusion protein activity. We use the CAD score as an indicator for assessing the structural similarity of the active site before and after protein fusion. Since the calculation of the CAD score requires the input of the active site and its surrounding residue numbers, we need to find the active site of the protein. First we found the existing database M-CSA and tried to use the active site information there directly for subsequent calculations.

testing: After the integration of the M-CSA database, we found that the residue numbers in the M-CSA database did not correspond to the residue numbers in the PDB database, and there were few proteins with active site information, so we finally abandoned the use of this database.

improving: We thought that users can find active sites in published articles or select active sites and radius of active centers based on users' experience, which can make the prediction of active site structure similarity more suitable for users' needs. The smaller the B-factor, the more difficult it is to change the structure, and the more difficult it is to change the spatial location of these residues, the more likely they are to be used as active sites. The residue preference of the active sites was also found statistically, and the sites screened by both B-factor and statistical laws will be recommended to users as more likely active sites.

- application

Designing: In the process of extending a two steps cascade reaction system model into three steps system, we found that the complexity is greatly increased due to the exponential increase in the types of enzyme complexes. After discussing, learning and understanding the biological mechanisms, we found that CRY2 only form dimer with light, which greatly simplified the number of enzyme complexes in the model. Therefore, we further refine and optimize the model that only dimerization is allowed in our reaction system. After determined the framework of the main model, we still need to build some sub-models. First of all, in order to simulate the environment of the cell, we take the diffusion coefficient of the solute in a complex environment into consideration. At the beginning, we planned to consider the influence of macromolecules and cytoplasmic environment on solute diffusion from the theoretical level. Through our repeated discussions, thinking and reading relevant paper, we found that the diffusion process is extremely complicated. We must consider multiple problems at once, for example the effect of collisions between substances in the diffusion, and the shape and size of the substance in the diffusion. In addition, molecular dynamics issues, the combination and separation of substances and macromolecules in the environment, the distribution of other substances in the cytoplasm, and the obstruction of molecules by the viscous cytoplasmic matrix also need to be considered. Our ability has not yet reached such a level. We hope to introduce the Monte Carlo method in stochastic simulation to indirectly reflect relations on micro-level through macro-level statistics and variables. After further analysis and discussion, we found that the rules of the Monte Carlo method are too difficult to set up, which involves a lot of biological mechanism, so we try to simplify the sub-model of the diffusion problem again. However, statistics analysis also in a dilemma, and there are also many questions that we do not know which requires a large amount of time and energy. After considering the primary issue is the relation of the light intensity and the reaction rate, there is no need to build a large model about the diffusion coefficient. Therefore, we use the existing coefficient mentioned in other articles as default diffusion coefficient. Building： Based on the model, we got a set of nonlinear ODE equations(converted to algebraic equations). Although we tried various mathematical software (MATLAB, Mathematica, Python, etc.) for a long time, we failed to get an explicit symbolic solution because of the complexity of the set of equations. The main reason is that there are too many variables as well as parameters concerning the reaction. Besides, we also have to deal with the situation when some variables have several solutions. By then, we realized that our model is too complex for us to get a mathematical outcome. Therefore, we decided to simplify the model. Briefly, we separated the three-step reaction system into two parts, decomposing all the equations into two sections about and respectively. When dealing with each section, we handled free enzyme and enzyme in complex separately first, and combined the outcome according to their proportion in the system. Although we still can not get an explicit symbolic solution by doing so, we are capable of getting the function of reaction rate about enzyme gathering proportion with numerical parameter input with our program. In the later stage of the modeling, we found that we need to calculate more parameters when we began to test the model, we need to calculate the PDB of the enzyme. When we calculate the active site, we first decided to directly calculate the multiple molecular structure of the complex enzyme. It is difficult to calculate the degree of the rotation of z-axis between two enzymes structure in a three-dimensional coordinate system. So we divide the enzyme complex structure into multiple parts, and then establish the rectangular coordinate system to calculate each part, and then transform the data through the relationship between the coordinate systems. Secondly, we have also established a sub-model for the description of the active site, which is equivalent to a sphere, which is convenient for subsequent calculations. However, considering the different status between different enzymes complexes, we believe the difference in size and relative position of active sites in the enzymes will affect the distance between active sites. Through the method of enzyme matching, we provide users with an input interface, as well as other acquisition methods and tools for obtaining PDB data, in order to improve the feasibility of the program. Testing: Finally, after combined the parameters and the final model, we start to test our model. We found that the data type mismatch problem may occur when calculating the active site, and it also needs the function of user preference and the active site of different enzymes also needs to be calculated repeatedly. And also consider the actual situation of multi-enzyme mixing, the ratio of the enzyme complex may change, and the user might encounter three enzyme complexes, how do we modify the combination and processing in this new scenario. In addition, we need to display the image result and the output of the function and output some values and calculations required by users.

Verification

We have validated the usability of our system through the following two in-cell experiments.

Experiment 1: Opto-controllable enzyme aggregation experiment

Select the two key enzymes (aroF, pheA) in the phenylalanine production pathway of E. coli as the enzymes we wanted to design. The opto-controllable enzymes were linked with CRY2 protein via a linker designed by our software Phoebus. In particular, the CRY2 protein is a protein that undergoes oligomerization under light of 450 nm wavelength. The two opto-controllable enzymes were introduced into E. coli separately. After stable expression, we extracted and mixed them to observe the rate of product production. Oligomerization of CRY2 under 450nm light will increase the local enzyme concentration and increase the reaction rate, so the yield of the final end product should be increased in a period of time.

The experiment protocol was initially designed to introduce both enzymes into the same E. coli at the same time, but it was too complicated for us. So, we decided to extract proteins for in vitro experiments.

As no one in the team had experience to construct expression vectors, we encountered many operations and problems in the process of designing the experiment (e.g., selecting enzyme digestion sites, designing primers, etc.) that we did not know how to do. We solved this problem by consulting our advisors, professors, and collaborating with the experimental team.

Experiment 2: Light control of protein aggregation reflecting by GFP

We wanted to construct an opto-controllable GFP by linking the CRY2 protein to GFP, which was introduced into mammalian cells and incubated under 450nm light or no light conditions. After a period of time, fluorescence microscopy was carried out to observe the distribution of green fluorescence in the cells to reflect the aggregation of CRY2.

Our original experimental plan was to carry out this experiment in E. coli and we had done the experiment. However, we found that we could not get good results under the fluorescence microscope. The green fluorescence aggregation in E. coli cells was similar whether in the presence or absence of light. This may be due to the small size of E. coli cell and the resolution of the fluorescence microscope was not enough for us. During our observations we also found that the background noise was high and the results were untrusted. After consulting with advisor, we redesigned the protocol to complete this experiment in mammalian cells. We also collaborated with the experimental team during this process.

Maintenance

In the future, we will continue to collect new photoswitch information to improve the photoswitch database, so as to provide users with more design options. If there is any problem with the database information and algorithm in the software or there is room for improvement, we will also debug and update it in time. We will still do our best to promote our software to more researchers. We hope that our software will bring convenience and value to synthetic biology research and production.