Team:Tsinghua/Software

Software-P2N

What is P2N?

Codon preference means that for amino acids with several codons, one or a few are preferred and are used disproportionately in a particular species. Different species have different codon preferences, and without proper optimization, the translation expression efficiency will be greatly reduced[1]. Actually, a lot of the proteins in our projects don't come from E. coli, but comes from the mammal. Therefore, we urgently need to have a codon optimization tool.
So, we design P2N to assist our projects and serve other potential users. P2N is a software which is programmed mainly in Python and is with the help of web crawlers. P2N can convert the user's amino acid sequence or DNA sequence input into optimized DNA sequence according to codon preference. The software can get the accession to a variety of species of codon preference informations by querying the public database - Codon Usage Database[2], and then using our own algorithm to give the highest confidence result to the users.
You can download the source code and the software here.

Why Develop P2N?

In our Metabolic balancing project, we are trying to use our engineered Lactic acid bacteria (LAB) to expression the BSH to regulate the metabolic homeostasis. However, we need to control the therapeutic LAB to stay in the intestinal tract for a period of time, so we plan to let the LAB express LAP to achieve the adhesion effect on the intestinal tract (see Bacterial adhesion modeling for details).

Fig.1 LAB express LAP (Listeria adhesion protein) can promote LAB adhesion to the small intestine cell, which is beneficial to the residence of engineered LAB.

However, in the molecular cloning process of LAP, we ran up against great difficulties. During 3 months of time, neither we nor the synthetic gene company failed to clone the LAP gene into pMG36e, the famous L.lactis vector. This has held up our project to a great extent and we have systematically analyzed the failures. Taking a closer look at the company's codon-optimized LAP DNA sequence, we realized that the codon optimization by the biotech company was a little bit unreasonable.
The LAP DNA sequence presented to us by the company contains a large number of repetitive sequences (AAAAA sequence appears 21 times in the LAP gene), and there are even more than a dozen repetitive restriction enzyme sites. As we know, these will lead to problems in PCR, such as slipping mismatches, wrong connections, and so on. More seriously a poly-A structure appears at one end of the LAP, which makes it extremely for us to do the molecular cloning (see Fig.2).

Fig.2 The poly-A structure appears at one end of the LAP gene, which gives us extremely difficulty to molecular cloning.

Although these issues were extremely frustrating, we actually got inspired. Why don't we develop a better software tool? Therefore, in order to find a more scientific and reasonable way to design sequences, we developed a codon preference algorithm.
For our project, this tool can assist us in codon optimization of the therapeutic protein part. For other users, the tool can also help them quickly find the best DNA sequence for a particular species. Therefore, the software is universal and convenient.

Why Choose P2N?

Firstly, it can effectively avoid repetitive sequences. For example, in linker, our software will avoid repetitive sequences such as ‘SSGSSGSSGSSG’ in DNA design.

Fig.3 P2N can avoid repetitive sequences.

Table 1 10 times codon optimization analysis for "SSGSSGSSGSSG" repetitive sequences.

The 1st time TCATCCGGAAGCAGTGGCAGCTCAGGCTCCAGCGGCTAA
The 2nd time AGCTCAGGATCTTCTGGTAGCTCAGGTAGCTCAGGCTGA
The 3rd time TCATCTGGTTCAAGTGGTAGCTCTGGCAGCTCTGGTTGA
The 4th time TCTAGCGGTTCATCTGGAAGTTCAGGCAGCTCAGGTTGA
The 5th time TCCAGTGGTTCTTCAGGTTCATCCGGTTCCTCAGGTTAA
The 6th time TCTTCTGGATCCTCCGGCTCTTCAGGCTCATCCGGTTAA
The 7th time TCCTCTGGTAGCTCCGGTAGCAGCGGAAGCAGCGGCTAA
The 8th time TCCTCAGGAAGCAGCGGCTCTTCTGGCTCAAGTGGATAA
The 9th time AGTAGTGGTAGCTCTGGATCTAGTGGAAGCTCTGGTTAA
The 10th time AGCAGTGGAAGTAGTGGATCAAGCGGCAGTAGCGGTTAA

Secondly, it can avoid duplication of restriction sites. This will make our digestion and link process more precise. As is shown in Fig.4, GGATCC is the BamHI recognition site. P2N can help us avoid the BamHI recognition site when facing "(GGATCC)5" repetitive sequences.

Fig.4 P2N can avoid restriction sites. The recognition site of BamHI is "GGATCC", and P2N can help us avoid the BamHI recognition site.

Table 2 10 times codon optimization analysis for "(GGATCC)5" BamHI-site repetitive sequences, which can also avoid repetitive sequences.

The 1st time GGTTCTGGCTCTGGCTCTGGTTCCGGCAGCTAA
The 2nd time GGCAGTGGCAGCGGCAGCGGCAGTGGTAGCTGA
The 3rd time GGCAGTGGCAGCGGCAGCGGCAGTGGTAGCTGA
The 4th time GGCTCGGGTTCTGGTTCGGGCTCCGGCTCTTGA
The 5th time GGCAGTGGTTCTGGCTCGGGTTCGGGTTCCTAA
The 6th time GGTAGCGGCAGTGGCAGTGGTAGCGGCAGCTGA
The 7th time GGCTCTGGCTCCGGTTCGGGCAGTGGCTCCTAA
The 8th time GGTTCTGGTTCCGGTAGCGGCAGTGGCTCTTGA
The 9th time GGCTCGGGCTCGGGCAGCGGCTCGGGCAGCTAG
The 10th time GGTAGTGGCAGTGGCTCTGGCTCCGGTAGCTAA

Thirdly, it has high accuracy. The existing tools are only accurate to the species level, while we can be accurate to the subspecies level relying on the database. In the future, we also plan to be accurate to the mitochondrial level.

Fig.5 P2N has high accuracy, which is accurate to the species level.

Table 3 10 times codon optimization analysis for "SSGSSGSSGSSG" repetitive sequences in E.coli Nissle 1917, and Lactobacillus lindneri.

Escherichia coli Nissle 1917 Lactobacillus lindneri
The 1st time TCATCCGGAAGCAGTGGCAGCTCAGGCTCCAGCGGCTAA TCTAGCGGATCCTCCGGATCAAGTGGCTCCTCAGGCTAA
The 2nd time AGCTCAGGATCTTCTGGTAGCTCAGGTAGCTCAGGCTGA TCATCAGGAAGCTCAGGCTCCAGCGGAAGTAGTGGCTAA
The 3rd time TCATCTGGTTCAAGTGGTAGCTCTGGCAGCTCTGGTTGA TCCTCAGGATCCTCAGGAAGTTCTGGCAGCAGTGGCTGA
The 4th time TCTAGCGGTTCATCTGGAAGTTCAGGCAGCTCAGGTTGA AGCAGCGGATCTAGTGGCTCAAGCGGATCCTCCGGATAA
The 5th time TCCAGTGGTTCTTCAGGTTCATCCGGTTCCTCAGGTTAA AGTAGTGGCAGTAGTGGATCCAGTGGCAGCAGCGGATAA
The 6th time TCTTCTGGATCCTCCGGCTCTTCAGGCTCATCCGGTTAA AGTAGTGGCTCATCAGGCAGCAGCGGCTCCTCTGGATGA
The 7th time TCCTCTGGTAGCTCCGGTAGCAGCGGAAGCAGCGGCTAA TCTTCTGGCAGTTCTGGCAGCAGTGGAAGCTCAGGATGA
The 8th time TCCTCAGGAAGCAGCGGCTCTTCTGGCTCAAGTGGATAA TCTTCTGGATCCAGCGGCTCTAGTGGAAGTAGTGGCTAA
The 9th time AGTAGTGGTAGCTCTGGATCTAGTGGAAGCTCTGGTTAA TCCTCTGGCTCATCAGGAAGTAGTGGCAGTAGTGGCTAA
The 10th time AGCAGTGGAAGTAGTGGATCAAGCGGCAGTAGCGGTTAA AGCAGCGGCTCATCCGGAAGCAGTGGCAGTAGTGGCTGA

Fourthly, the time complexity of the program is very low, and the time cost of each codon optimization is less than 1 second.
Last but not the least, our P2N software also has good optimization confidence. We used our P2N codon optimization software to optimize multiple proteins in our therapeutic protein and secretion peptide part (see Mucosal Healing for detailed) in EcN, while comparing it with existing tools GenScript® online codon preference tool (GenSmart). The sequence predicted by our software shows a little bit closer to the theoretical value of EcN.

Fig.6 Codon preference confidence analysis , in theroy, the total GC% of EcN is 49.13%, 1st letter GC% is 55.38%, 2nd letter GC% is 42.34%, and 3rd letter GC% is 50.58%. We compare P2N and GenScript® online codon preference tool (GenSmart) analysis results for the bias from theoretical values. The lighter the squares are, the better for the codon optimization.(DNA sequence of each protein is detailed in the part page)

How To Use P2N?

Firstly, enter the species’ name on the software's UI. Secondly, input the gene sequence. Thirdly, check whether to avoid the restriction enzyme sites or add end codon.

Fig.7 The UI surface of P2N, the arrows show how to use P2N.

Our P2N will then get the species information entered by the user, as well as the genetic sequence information. Then, web crawlers will crawl codon preference tables through the database. Codon Bias Table will be generated according to the occurrence probability of different codons of the same amino acid in the table. Next, the most likely RNA sequence will be generated according to Codon Bias Table. In this step, our program can check whether there is repeating sequence or duplication of restriction sites. If so, the sequence will be replaced reasonably. Finally, the corresponding DNA sequence will be generated by base complementary pairing principle. Thus, the user can obtain the DNA sequence output.

Fig.8 The basic algorithm for P2N

Reference

[1] Mauro V. P. (2018). Codon Optimization in the Production of Recombinant Biotherapeutics: Potential Risks and Considerations. BioDrugs : clinical immunotherapeutics, biopharmaceuticals and gene therapy, 32(1), 69–81. https://doi.org/10.1007/s40259-018-0261-x
[2]Codon Usage Database. http://www.kazusa.or.jp/codon/

Copyright © 2021 iGEM Team: Tsinghua. All rights reserved.