Software-P2N

What Is P2N
Why Develop
Why Choose
How To Use

What is P2N?

Codon preference means that for amino acids with several codons, one or a few are preferred and are used disproportionately in a particular species. Different species have different codon preferences, and without proper optimization, the translation expression efficiency will be greatly reduced^[1]. Actually, a lot of the proteins in our projects don't come from E. coli, but comes from the mammal. Therefore, we urgently need to have a codon optimization tool.
So, we design P2N to assist our projects and serve other potential users. P2N is a software which is programmed mainly in Python and is with the help of web crawlers. P2N can convert the user's amino acid sequence or DNA sequence input into optimized DNA sequence according to codon preference. The software can get the accession to a variety of species of codon preference informations by querying the public database - Codon Usage Database^[2], and then using our own algorithm to give the highest confidence result to the users.
You can download the source code and the software here.

Why Develop P2N?

In our Metabolic balancing project, we are trying to use our engineered Lactic acid bacteria (LAB) to expression the BSH to regulate the metabolic homeostasis. However, we need to control the therapeutic LAB to stay in the intestinal tract for a period of time, so we plan to let the LAB express LAP to achieve the adhesion effect on the intestinal tract (see Bacterial adhesion modeling for details).

Fig.1 LAB express LAP (Listeria adhesion protein) can promote LAB adhesion to the small intestine cell, which is beneficial to the residence of engineered LAB.

However, in the molecular cloning process of LAP, we ran up against great difficulties. During 3 months of time, neither we nor the synthetic gene company failed to clone the LAP gene into pMG36e, the famous L.lactis vector. This has held up our project to a great extent and we have systematically analyzed the failures. Taking a closer look at the company's codon-optimized LAP DNA sequence, we realized that the codon optimization by the biotech company was a little bit unreasonable.
The LAP DNA sequence presented to us by the company contains a large number of repetitive sequences (AAAAA sequence appears 21 times in the LAP gene), and there are even more than a dozen repetitive restriction enzyme sites. As we know, these will lead to problems in PCR, such as slipping mismatches, wrong connections, and so on. More seriously a poly-A structure appears at one end of the LAP, which makes it extremely for us to do the molecular cloning (see Fig.2).

Fig.2 The poly-A structure appears at one end of the LAP gene, which gives us extremely difficulty to molecular cloning.

Although these issues were extremely frustrating, we actually got inspired. Why don't we develop a better software tool? Therefore, in order to find a more scientific and reasonable way to design sequences, we developed a codon preference algorithm.
For our project, this tool can assist us in codon optimization of the therapeutic protein part. For other users, the tool can also help them quickly find the best DNA sequence for a particular species. Therefore, the software is universal and convenient.

Why Choose P2N?

Firstly, it can effectively avoid repetitive sequences. For example, in linker, our software will avoid repetitive sequences such as ‘SSGSSGSSGSSG’ in DNA design.

Fig.3 P2N can avoid repetitive sequences.

Table 1 10 times codon optimization analysis for "SSGSSGSSGSSG" repetitive sequences.

The 1st time TCATCCGGAAGCAGTGGCAGCTCAGGCTCCAGCGGCTAA

The 2nd time AGCTCAGGATCTTCTGGTAGCTCAGGTAGCTCAGGCTGA

The 3rd time TCATCTGGTTCAAGTGGTAGCTCTGGCAGCTCTGGTTGA

The 4th time TCTAGCGGTTCATCTGGAAGTTCAGGCAGCTCAGGTTGA

The 5th time TCCAGTGGTTCTTCAGGTTCATCCGGTTCCTCAGGTTAA

The 6th time TCTTCTGGATCCTCCGGCTCTTCAGGCTCATCCGGTTAA

The 7th time TCCTCTGGTAGCTCCGGTAGCAGCGGAAGCAGCGGCTAA

The 8th time TCCTCAGGAAGCAGCGGCTCTTCTGGCTCAAGTGGATAA

The 9th time AGTAGTGGTAGCTCTGGATCTAGTGGAAGCTCTGGTTAA

The 10th time AGCAGTGGAAGTAGTGGATCAAGCGGCAGTAGCGGTTAA

Secondly, it can avoid duplication of restriction sites. This will make our digestion and link process more precise. As is shown in Fig.4, GGATCC is the BamHI recognition site. P2N can help us avoid the BamHI recognition site when facing "(GGATCC)₅" repetitive sequences.

Fig.4 P2N can avoid restriction sites. The recognition site of BamHI is "GGATCC", and P2N can help us avoid the BamHI recognition site.

Table 2 10 times codon optimization analysis for "(GGATCC)₅" BamHI-site repetitive sequences, which can also avoid repetitive sequences.

The 1st time GGTTCTGGCTCTGGCTCTGGTTCCGGCAGCTAA

The 2nd time GGCAGTGGCAGCGGCAGCGGCAGTGGTAGCTGA

The 3rd time GGCAGTGGCAGCGGCAGCGGCAGTGGTAGCTGA

The 4th time GGCTCGGGTTCTGGTTCGGGCTCCGGCTCTTGA

The 5th time GGCAGTGGTTCTGGCTCGGGTTCGGGTTCCTAA

The 6th time GGTAGCGGCAGTGGCAGTGGTAGCGGCAGCTGA

The 7th time GGCTCTGGCTCCGGTTCGGGCAGTGGCTCCTAA

The 8th time GGTTCTGGTTCCGGTAGCGGCAGTGGCTCTTGA

The 9th time GGCTCGGGCTCGGGCAGCGGCTCGGGCAGCTAG

The 10th time GGTAGTGGCAGTGGCTCTGGCTCCGGTAGCTAA

Thirdly, it has high accuracy. The existing tools are only accurate to the species level, while we can be accurate to the subspecies level relying on the database. In the future, we also plan to be accurate to the mitochondrial level.

Fig.5 P2N has high accuracy, which is accurate to the species level.

Table 3 10 times codon optimization analysis for "SSGSSGSSGSSG" repetitive sequences in E.coli Nissle 1917, and Lactobacillus lindneri.

Escherichia coli Nissle 1917 Lactobacillus lindneri

The 1st time TCATCCGGAAGCAGTGGCAGCTCAGGCTCCAGCGGCTAA TCTAGCGGATCCTCCGGATCAAGTGGCTCCTCAGGCTAA

The 2nd time AGCTCAGGATCTTCTGGTAGCTCAGGTAGCTCAGGCTGA TCATCAGGAAGCTCAGGCTCCAGCGGAAGTAGTGGCTAA

The 3rd time TCATCTGGTTCAAGTGGTAGCTCTGGCAGCTCTGGTTGA TCCTCAGGATCCTCAGGAAGTTCTGGCAGCAGTGGCTGA

The 4th time TCTAGCGGTTCATCTGGAAGTTCAGGCAGCTCAGGTTGA AGCAGCGGATCTAGTGGCTCAAGCGGATCCTCCGGATAA

The 5th time TCCAGTGGTTCTTCAGGTTCATCCGGTTCCTCAGGTTAA AGTAGTGGCAGTAGTGGATCCAGTGGCAGCAGCGGATAA

The 6th time TCTTCTGGATCCTCCGGCTCTTCAGGCTCATCCGGTTAA AGTAGTGGCTCATCAGGCAGCAGCGGCTCCTCTGGATGA

The 7th time TCCTCTGGTAGCTCCGGTAGCAGCGGAAGCAGCGGCTAA TCTTCTGGCAGTTCTGGCAGCAGTGGAAGCTCAGGATGA

The 8th time TCCTCAGGAAGCAGCGGCTCTTCTGGCTCAAGTGGATAA TCTTCTGGATCCAGCGGCTCTAGTGGAAGTAGTGGCTAA

The 9th time AGTAGTGGTAGCTCTGGATCTAGTGGAAGCTCTGGTTAA TCCTCTGGCTCATCAGGAAGTAGTGGCAGTAGTGGCTAA

The 10th time AGCAGTGGAAGTAGTGGATCAAGCGGCAGTAGCGGTTAA AGCAGCGGCTCATCCGGAAGCAGTGGCAGTAGTGGCTGA

Fourthly, the time complexity of the program is very low, and the time cost of each codon optimization is less than 1 second.
Last but not the least, our P2N software also has good optimization confidence. We used our P2N codon optimization software to optimize multiple proteins in our therapeutic protein and secretion peptide part (see Mucosal Healing for detailed) in EcN, while comparing it with existing tools GenScript® online codon preference tool (GenSmart). The sequence predicted by our software shows a little bit closer to the theoretical value of EcN.

Fig.6 Codon preference confidence analysis , in theroy, the total GC% of EcN is 49.13%, 1st letter GC% is 55.38%, 2nd letter GC% is 42.34%, and 3rd letter GC% is 50.58%. We compare P2N and GenScript® online codon preference tool (GenSmart) analysis results for the bias from theoretical values. The lighter the squares are, the better for the codon optimization.(DNA sequence of each protein is detailed in the part page)

How To Use P2N?

Firstly, enter the species’ name on the software's UI. Secondly, input the gene sequence. Thirdly, check whether to avoid the restriction enzyme sites or add end codon.

Fig.7 The UI surface of P2N, the arrows show how to use P2N.

Our P2N will then get the species information entered by the user, as well as the genetic sequence information. Then, web crawlers will crawl codon preference tables through the database. Codon Bias Table will be generated according to the occurrence probability of different codons of the same amino acid in the table. Next, the most likely RNA sequence will be generated according to Codon Bias Table. In this step, our program can check whether there is repeating sequence or duplication of restriction sites. If so, the sequence will be replaced reasonably. Finally, the corresponding DNA sequence will be generated by base complementary pairing principle. Thus, the user can obtain the DNA sequence output.

Fig.8 The basic algorithm for P2N

Reference

[1] Mauro V. P. (2018). Codon Optimization in the Production of Recombinant Biotherapeutics: Potential Risks and Considerations. BioDrugs : clinical immunotherapeutics, biopharmaceuticals and gene therapy, 32(1), 69–81. https://doi.org/10.1007/s40259-018-0261-x
[2]Codon Usage Database. http://www.kazusa.or.jp/codon/

The 1st time	TCATCCGGAAGCAGTGGCAGCTCAGGCTCCAGCGGCTAA
The 2nd time	AGCTCAGGATCTTCTGGTAGCTCAGGTAGCTCAGGCTGA
The 3rd time	TCATCTGGTTCAAGTGGTAGCTCTGGCAGCTCTGGTTGA
The 4th time	TCTAGCGGTTCATCTGGAAGTTCAGGCAGCTCAGGTTGA
The 5th time	TCCAGTGGTTCTTCAGGTTCATCCGGTTCCTCAGGTTAA
The 6th time	TCTTCTGGATCCTCCGGCTCTTCAGGCTCATCCGGTTAA
The 7th time	TCCTCTGGTAGCTCCGGTAGCAGCGGAAGCAGCGGCTAA
The 8th time	TCCTCAGGAAGCAGCGGCTCTTCTGGCTCAAGTGGATAA
The 9th time	AGTAGTGGTAGCTCTGGATCTAGTGGAAGCTCTGGTTAA
The 10th time	AGCAGTGGAAGTAGTGGATCAAGCGGCAGTAGCGGTTAA

The 1st time	GGTTCTGGCTCTGGCTCTGGTTCCGGCAGCTAA
The 2nd time	GGCAGTGGCAGCGGCAGCGGCAGTGGTAGCTGA
The 3rd time	GGCAGTGGCAGCGGCAGCGGCAGTGGTAGCTGA
The 4th time	GGCTCGGGTTCTGGTTCGGGCTCCGGCTCTTGA
The 5th time	GGCAGTGGTTCTGGCTCGGGTTCGGGTTCCTAA
The 6th time	GGTAGCGGCAGTGGCAGTGGTAGCGGCAGCTGA
The 7th time	GGCTCTGGCTCCGGTTCGGGCAGTGGCTCCTAA
The 8th time	GGTTCTGGTTCCGGTAGCGGCAGTGGCTCTTGA
The 9th time	GGCTCGGGCTCGGGCAGCGGCTCGGGCAGCTAG
The 10th time	GGTAGTGGCAGTGGCTCTGGCTCCGGTAGCTAA

	Escherichia coli Nissle 1917	Lactobacillus lindneri
The 1st time	TCATCCGGAAGCAGTGGCAGCTCAGGCTCCAGCGGCTAA	TCTAGCGGATCCTCCGGATCAAGTGGCTCCTCAGGCTAA
The 2nd time	AGCTCAGGATCTTCTGGTAGCTCAGGTAGCTCAGGCTGA	TCATCAGGAAGCTCAGGCTCCAGCGGAAGTAGTGGCTAA
The 3rd time	TCATCTGGTTCAAGTGGTAGCTCTGGCAGCTCTGGTTGA	TCCTCAGGATCCTCAGGAAGTTCTGGCAGCAGTGGCTGA
The 4th time	TCTAGCGGTTCATCTGGAAGTTCAGGCAGCTCAGGTTGA	AGCAGCGGATCTAGTGGCTCAAGCGGATCCTCCGGATAA
The 5th time	TCCAGTGGTTCTTCAGGTTCATCCGGTTCCTCAGGTTAA	AGTAGTGGCAGTAGTGGATCCAGTGGCAGCAGCGGATAA
The 6th time	TCTTCTGGATCCTCCGGCTCTTCAGGCTCATCCGGTTAA	AGTAGTGGCTCATCAGGCAGCAGCGGCTCCTCTGGATGA
The 7th time	TCCTCTGGTAGCTCCGGTAGCAGCGGAAGCAGCGGCTAA	TCTTCTGGCAGTTCTGGCAGCAGTGGAAGCTCAGGATGA
The 8th time	TCCTCAGGAAGCAGCGGCTCTTCTGGCTCAAGTGGATAA	TCTTCTGGATCCAGCGGCTCTAGTGGAAGTAGTGGCTAA
The 9th time	AGTAGTGGTAGCTCTGGATCTAGTGGAAGCTCTGGTTAA	TCCTCTGGCTCATCAGGAAGTAGTGGCAGTAGTGGCTAA
The 10th time	AGCAGTGGAAGTAGTGGATCAAGCGGCAGTAGCGGTTAA	AGCAGCGGCTCATCCGGAAGCAGTGGCAGTAGTGGCTGA

Team:Tsinghua/Software

Software-P2N

What is P2N?

Why Develop P2N?

Why Choose P2N?

How To Use P2N?

Reference