Bioinformatics

Introduction

For successful implementation of biocontainment, the process of conjugation must be inhibited to fortify against horizontal gene transfer between microbes. If a microbe with recombinant DNA is capable of engaging in conjugation, it would be able to directly transfer synthetic plasmids and nucleic acids to other microbes in the wild, spreading it throughout the population. Therefore, disabling conjugation is imperative to the biocontainment of microbes intended for unsupervised release, such as in wild environments or in gut microbiomes.

We devised a system to disable conjugative plasmids that hinges on CRISPR-Cas9. We would create a small array that would code for sgRNA that would target cleavage of conjugative plasmids in locations essential to conjugation. Therefore, once this essential region is broken, translation from said plasmid would be inhibited and the conjugative plasmid would be rendered useless. In order to optimize the number of sgRNA, we needed to discover highly conserved regions across many conjugative plasmids to knockout function in the most number of plasmids, with the least possibility of off target knockout. To pool and sift through this computationally demanding task, we created the bioinformatics team as a response.

Literature Review

To begin identifying regions of conservation on conjugative plasmids, we first established a classification system to organize plasmids. Research into plasmid classification revealed two ways plasmids are characterized: the MOB typing system and the Incompatibility type system.

The MOB system partitions conjugative plasmids by their Relaxase, an enzyme that is crucial to induce conjugation [1,2]. However, the classifications are based upon the amino acid sequence of the relaxase, and therefore are subject to codon degeneracy in their nucleotide sequences, rendering the MOB system incompatible for our purposes.

The Incompatibility system partitions plasmids into groups depending on if they can coexist with one another in the same environment [2,3]. If two plasmids are too similar, they are rendered incompatible. Because there is far more information on incompatibility groups than on MOB groups, we were more reliably able to categorize plasmid sequences that were collected. It is for this reason we focused on Incompatibility typing. Among the incompatibility plasmids, we chose to focus on Incompatibility P (IncP) plasmids due to the heavy characterization of the plasmid RP4 [4].

After narrowing our sites on the Incompatibility P (IncP) plasmid family, we sought to identify regions on plasmids that are heavily conserved. Initial research led us to a region of plasmids known as the Origin of Transfer (OriT) site [4]. The OriT is a nucleotide region on the conjugative plasmid where the Relaxase enzyme attaches, initiating the process of conjugation of DNA from the host to the new recipient. However, OriT regions were seldom labelled on conjugative plasmids sequences. However, the conjugative relaxase locks onto this region, and genes coding for relaxases are far more frequently labelled.

This then led us to search for the gene coding for the conjugative relaxase of IncP plasmids, that gene being TraI. Then, we were able to identify the catalytic center of the TraI conjugative relaxase, which is its Tyrosine 22 residue [5]. As expected, the catalytic center is the most highly conserved amino acid of the relaxase gene, and therefore codon degeneracy will be minimal in this CDS region. Consequently, regions that we target Cas9 to knock out will be highly conserved among many sites within the plasmid population.

Methods

To identify nucleotide conservation across populations of plasmids, multiple methods were utilized to observe the degree of nucleotide similarity between many plasmids. Clustal Omega was utilized to generate multiple sequence alignments of many sizes, from up to 30 whole plasmid sequences to hundreds of gene sequences. This technique was ultimately our most successful, generating many alignments which were used for exploratory data analysis as well as for our ultimate list of sequences for Cas9 to target. All against all BLAST searches were also performed against the datasets, the results of which were fed into the program Cytoscape to generate sequence similarity networks [6]. These networks are designed to show the relationships between plasmid sequences, but analysis of the networks created from our all against all BLAST searches showed no relationship between sequences and their relationships in the network. As a consequence this novel method was scrapped. Phylogenetic trees were also constructed from sequence alignment data using the program RAxML [7]. Phylogenetic trees were initially constructed using small sequence alignments, but these trees were likely subject to errors and discarded. If there was time, it would have been used to create phylogenies of gene sequences to analyze relationships between genes instead of whole plasmids.

Conjugative plasmid sequences were originally collected by hand using the NCBI nucleotide database [8], as well as a plasmid database called the PLSDB Plasmid Database [9]. The PLSDB allowed us to collect plasmid sequences categorized by Incompatibility type, which was of use for both intra- and inter-family sequence alignments. The NCBI nucleotide database was used again later in the process to procure sequences of particular genes such as TraI as opposed to entire plasmids to create more applicable sequence alignments.

We used Benchling to generate and visualize small sequence alignments composed of one to five plasmids. Clustal Omega [10] was used to generate multiple sequence alignments of larger plasmid datasets including the complete list of all gene-specific alignments. These larger alignments were then visualized using the program Mega-X [11].

Multiple Sequence Alignment

Multiple sequence alignment initially served as a method to identify regions of conservation that may not have been covered in initial literature review about conjugative plasmid sequences and their gene contents. We first took whole plasmid genomes and aligned them against one another to identify if and where there were continuous sequences of nucleotide homology. The nucleotide sequences were then BLAST searched, on the NCBI website, to determine the genes and regions most commonly conserved in said alignment.

Initially this method allowed us to observe conserved nucleotide sequences in small alignments of 2-5 sequences on Benchling. These analyses were performed on Benchling to visualize the degree of homology between and among plasmid families. These alignments gave mixed results however: alignments were incredibly imprecise, with plasmids of the same family frequently failing to align. This challenge with aligning plasmids of the same species led us to question our continued use of benchling for even preliminary sequence alignments. This inability to accurately align plasmids of the same family was presumed to be due to their immense size in combination with hardware limitations, given that benchling is a web browser-based application. With this hypothesis, we presumed we would need stronger computational abilities to run the alignments we wanted. As a consequence, we began creating our alignments with Clustal Omega.

Clustal Omega was used to construct larger sequence alignments, which gave us a more detailed picture of the degree of nucleotide homology among conjugative plasmids. Clustal was used to create our initial 24 sequence IncP plasmid alignment, as well as our 38-sequence multi-family alignment. Still, errors and software issues persisted. Most noticeably, issues regarding memory allocation began appearing, suggesting that we did not supply the software with enough memory to run the above alignments. However, discussion with other bioinformaticians revealed that alignments of much larger size could be run with the parameters we had set: the only difference is their alignment used many small sequences rather than a few large sequences. Persistently stonewalled by these errors, we began to look into the validity of our current method of data exploration: performing alignments of large sequences to then comb through the resulting alignment for highly conserved sequences.

However, this method was flawed. After researching our problem, we discovered that multiple sequence alignment programs were not designed to align sequences above a few thousand base pairs. When large alignments are performed, many software alter parameters to expedite the process of alignment at the expense of alignment accuracy or precision. The whole plasmid sequences we were running up to that point regularly topped over 100,000 base pairs. As a consequence, the alignments produced from such sequences were likely hard-capped at a small size. Though merely 20 to 30 sequences may inform us of basic insights, we wanted our alignments to pull from larger sample sizes to rigorously address the question. To rectify this limitation, we decided to only align gene sequences, given their small size. Now, instead of using multiple sequence alignment to discover conserved sites, we decided to simply perform a deeper exploration of literature to identify notable regions of conservation, rather than finding these conserved sites by the alignment of whole plasmid sequences. Through this process, we were able to identify a region of interest: the TraI gene of Incompatibility P conjugative plasmids, which codes for an enzyme known as the relaxase. This enzyme is pivotal to conjugation, and therefore posed as an excellent target for knockout.

Results

The bioinformatics sub-team was able to perform a gene homology assessment on 284 plasmids in the Incompatibility P family. We addressed the homology of the TraI gene, which codes for the conjugative relaxase, an enzyme essential for the process of conjugation and DNA transfer . We discovered the catalytic center of the relaxase, Tyrosine 22 [8] , and created a list of 10 sequences to target for knock-out with CRISPR-Cas9. Our consensus sequence is as follows:

AAGGATAATTACTATGTGCTGGG

Another 9 sequences have been selected from the IncP plasmid multiple sequence alignment, being the following:

AAAGACAATTACTACGTCATCGG

AAGGATAATTACTACGTCATAGG

AGGGATAACTACTACGTGCTGGG

AAGGATAATTACTATGTACTGGG

TCCGATAACTACTATTTTCTGGG

AAGGATAATTACTATGTCTTGGG

AAAGATAATTACTACGTCATCGG

GAAGACAACTACTATGCCAGCGG

AAGGATAATTACTACGTTATCGG

These sequences, in combination, should make for excellent targets for Cas9. The consensus sequence should cleave a plurality of IncP plasmids in this essential region, while the rest act as redundancies in the system to ensure cleavage in any IncP plasmids with nucleotide substitutions due to codon degeneracy. With this nucleotide region knocked out, the relaxase should cease to function, failing to bind to the conjugative plasmid’s nucleotide backbone. As a consequence, the relaxase will not bond to the conjugative DNA, and will not catalyze its unfurling from circular to linear form [5]. As a consequence, plasmids will not be able to replicate for the purposes of conjugation, ceasing their proliferation through conjugation, This excision should simultaneously induce a frameshift mutation, further impacting the relaxase’s ability to function. It is hoped that effects downstream would further impact the relaxase, and by extension the relaxosome’s ability to function and catalyze the process of transfer. Together, we hope that disabling this region will inhibit the process of conjugation enough to strengthen the biocontainment of our host significantly.

Future Avenues

Once our new methodology was secured and proven to work in creating alignments, only a few weeks were left in the program. We would be thrilled if our method of identifying plasmids amenable to knockout by CRISPR-Cas9 could be utilized to identify regions of homology in other plasmid incompatibility families, such as the IncN, IncW, and IncX families, which have considerable structural similarities in their conjugative apparatus [12]. This system of knockout targets was possible with IncP plasmids due to the single catalytic center. To disable the IncP TraI relaxase, only one amino acid needs to be knocked out, that being the catalytic center. A close relative of the TraI relaxase, the TrwC relaxase, requires two separate amino acids to be knocked out in order to cease functioning [13]. Therefore, regions that can only be disabled with multiple knockouts are unsuited for the conjugation prevention system. If there was more time, we would have developed a more rigorous system to choose the binding regions, perhaps using a script to comb through all possible targeting regions to determine the most frequent nucleotide permutations in the region targeted. Finally, other regions than the conjugative relaxase could potentially be investigated for knock-out purposes. Other members of the TraI gene family could be targeted to produce similar incapacitation of the conjugative apparatus as is caused by targeting TraI.

References

Garcillán-Barcia MP, Francia MV, de La Cruz F. 2009. The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiology Reviews 33:657–687.
Shintani M, Sanchez ZK, Kimbara K. 2015. Genomics of microbial plasmids: Classification and identification based on replication and transfer systems and host taxonomy. Frontiers in Microbiology 6.
Novick , R. P. (1987). Plasmid incompatibility . Microbiol. Rev . 51 , 381 - 395.
Pansegrau W, Balzer D, Kruft V, Lurz R, Lanka E. 1990. In vitro assembly of relaxosomes at the transfer origin of Plasmid RP4. Proceedings of the National Academy of Sciences 87:6555–6559.
Pansegrau W, Schroder W, Lanka E. 1993. Relaxase (TraI) of INCP alpha plasmid RP4 catalyzes a site-specific cleaving-joining reaction of single-stranded DNA. Proceedings of the National Academy of Sciences 90:2925–2929.
Shannon P. 2003. Cytoscape: A software environment for integrated models of Biomolecular Interaction Networks. Genome Research 13:2498–2504.
Stamatakis A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313.
Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, Comeau DC, Funk K, Kim S, Klimke W, Marchler-Bauer A, Landrum M, Lathrop S, Lu Z, Madden TL, O’Leary N, Phan L, Rangwala SH, Schneider VA, Skripchenko Y, Wang J, Ye J, Trawick BW, Pruitt KD, Sherry ST. 2018. Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research 49.
Galata V, Fehlmann T, Backes C, Keller A. 2018. PLSDB: A resource of complete bacterial plasmids. Nucleic Acids Research 47.
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. 2011. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7.
Kumar S, Stecher G, Li M, Knyaz C, Tamura K. 2018. Mega X: Molecular evolutionary genetics analysis across computing platforms. Molecular Biology and Evolution 35:1547–1549.
Schröder G, Lanka E. 2005. The mating pair formation system of conjugative plasmids—a versatile secretion machinery for transfer of proteins and DNA. Plasmid 54:1–25.
Grandoso G, Avila P, Cayón A, Hernando MA, Llosa M, de la Cruz F. 2000. Two active-site tyrosyl residues of protein TrwC Act sequentially at the origin of transfer during plasmid R388 conjugation. Journal of Molecular Biology 295:1163–1172.

Team:MichiganState/Bioinformatics