Team:TAU Israel/Future Plans

TAU_Israel's Header

Our animated logo is keeping you company until the page has loaded.

Layout for Future Plans page

Future Plans - Engineering

Our Vision

We believe that this software tool is the first step in the development of a new philosophy that promotes the progression of genetic engineering beyond the borders of the supervised lab in a safe and controlled manner. 

Due to the novelty of this initiative, the models and software we have proposed are just the tip of the iceberg in means of further progression of this idea, thus many additional aspects may be added in order to improve this technology and broaden the supplied abilities. 

The future directions are in different research and development steps, but they all clarify the critical role of this technology in the future of synthetic biology and include the following components:

  1. Modeling the Origin of Replication

    The origin of replication is the genetic element that promotes replication, thus has the largest effect on the long-term interactions between the plasmid and the microbiome. The origin of replication can be tailored to the microbial population by conserving the innate topology present in it. Read more here.

  2. Modeling the Shine-Dalgarno position

    The Shine-Dalgarno sequence places the small ribosomal subunit on the mRNA sequence and critically affects transcription patterns. The distance between the translation initiation codon, which is usually ATG, tend to slightly vary among organisms, a property which can be used to enhance the selectivity of translation. Read more here.

  3. CRISPR system utilization

    Clustered regularly interspaced short palindromic repeats (CRISPR) is another bacterial defense system which can be used to achieve a more binary degree of selectivity. Read more here.

  4. Plasmid family approach - beyond the single sequence solution

    As seen in the model analysis, when dealing with large groups of organisms that do not conserve evolutionary taxonomy between the split between optimized and deoptimized organisms, optimization is difficult to obtain on average and nearly impossible to obtain for the margins (least optimized of the optimized group and most deoptimized from he deoptimized group). A solution proposed by Prof. Elhanan Borenstein was to propose not one optimal solution, but a couple of them. Read more here.

ORI - Origin Of Replication

Bacterial origin of replication, also known as “OriC”, is a specific region in the bacterial chromosome that is responsible for regulating the process of DNA replication. OriC’s are diverse in size (250bp to 2kbp) and contain three important functional elements that control origin activity:

  1. DnaA boxes are highly conserved 9-base sequences which are AT-rich and therefore have a lower melting temperature, to which the protein DNAa binds to to initiate DNA replication.
  2. DnaA Unbinding Elements (DUE) - AT-rich regions (13-mer in size) with lower, which triggers separation following DnaA binding to adjacent DNAa boxes. This provides the entry site for helicase and later the entire replication machinery (such as primase, DNA Pol II, etc.).
  3. Several protein bindings sites that help in regulating DNA replication initiation. 

The nucleotide sequence of OriC is highly diverse across unrelated species, thus they are not active in or interchangeable between unrelated bacteria. OriC is diverse in terms of chromosomal loci, genetic context, nucleotide sequence, length, and continuities. In some bacteria, the OriC is composed of two separate subregions (bipartite), each of which contains DnaA boxes, but only one contains DUE. The subregions are separated with a spacer gene (usually DnaA). 


Figure 1: ORI topology [1]

DnaA boxes

In most bacteria, DnaA boxes consist of a conserved 9-mers sequence. The number of boxes is usually different from species to species. Not surprisingly, DnaA in certain bacteria was demonstrated to recognize OriC/DNAa boxes of different bacteria, due to the similarity of DnaA boxes. However, the protein wasn’t able to bend the DNA structure in the same manner for unwinding the DNA strands. This provides further evidence that even when there are apparent similarities, the DnaA-OriC systems of individual species are not easily interchangeable.


Figure 2: DnaA box in different organisms [1]

It is important to note that the table above represents the perfect high-affinity DnaA box sequences. Each bacterium has its own specific combination of high and low-affinity DnaA boxes. The low-affinity DnaA boxes are different at least in two nucleotides from the high-affinity boxes. Spacing between low and high affinity boxes is responsible for the proper orientation of the DnaA proteins in the Orisome, and allows a cooperative binding of DnaA to adjacent DnaA boxes. Thus, spacing between these elements is different among the bacterial kingdom.

DUE

The DUE region is a typically AT-rich stretch of nucleotides [2] that often includes characteristic repeated AT-rich sequences (e.g., that of E. coli comprises three 13-mer repeats) separated by short, non-AT-rich insertions. DUE regions are thermodynamically unstable compared to their neighboring sequences, rendering them susceptible to superhelical stress arising from the formation of the DnaA oligomer. The initially unwound region ranges from 20 to 60 bps in size, depending on the organism, which seems to provide sufficient space to accommodate a replicative helicase, DnaB. The bacterial DUE regions are always located upstream or downstream of one or more DnaA box clusters, never in the midst of a cluster. It is important to note that the distance between the DUE and its proximal DnaA-box cluster is critical, as even slight changes were found to inhibit oriC unwinding.

Engineering Plan:

We wanted to engineer the ORI of our genome in a way that enhances the replication in the optime organisms and diminishes the replication in the deoptimised organism.

First, to search the topology of the ORI sequences in different bacteria, we looked for databases of ORIs in different organisms. We found out that the plasmid sequence is slightly different from the genomic sequence, which means we have to introduce changes to the genomic ORI or else the plasmid will be lost. We planned to map the difference between the genomic ORI to the plasmid ORI in order to understand which regions change between them and which changes we would have to do when we engineer our ORI.

We found “GammaBoris”, which has 25,000 different genomic ORIs but not plasmid sequences, so we kept searching and found Ori-Finder. The Ori-Finder system is designed for the oriC prediction in bacterial and archaeal genomes, which integrates gene prediction, analysis of base composition asymmetry, distribution of DnaA boxes, occurrence of genes frequently close to oriC regions and phylogenetic relationships. The system even has a database, DoriC, which includes the predicted results by Ori-Finder system and experimentally confirmed oriCs from literature in PubMed. The database stores over 9,928 records of prokaryotic chromosomal origins and includes 1209 records of plasmid replication origins for the first time. Unfortunately, this database still doesn't have enough data for plasmid ORIs. We are currently searching for another algorithm to locate the plasmid OriCs of different organisms to better understand the differences between the species.

Once we have enough data, we can engineer the plasmid ORI according to the ORI sequences from all optimized bacteria. We will develop a metric to compare and rank different features in the OriCs so that we can cluster them into very closely related groups. We will return a consensus sequence for each cluster and insert it into the plasmidin order to create a shuttle plasmid (a plasmid that is built in a way that enables propagation in two or more different host species [3]). This will lead to an ORI which optimizes the replication in the wanted hosts.  

References

  1. M. WolaÅ„ski, R. Å. Donczew, A. Zawilak-Pawlik, and J. Zakrzewska-CzerwiÅ„ska, “Oric-encoded instructions for the initiation of bacterial chromosome replication,” Frontiers in Microbiology, vol. 5, 2015. 
  2. M. Rajewska, K. Wegrzyn, and I. Konieczny, “At-rich region and repeated sequences – the essential elements of replication origins of bacterial replicons,” FEMS Microbiology Reviews, vol. 36, no. 2, pp. 408–434, 2012. 
  3. Molecular cell biology 4th edition. W H Freeman & Co, 2000. 



Shine Dalgarno Distances

The Shine Dalgarno (SD) sequence is a ribosomal binding site in the mRNA which is located upstream of the start codon (AUG) and facilitates the initiation of protein translation[1,2]. We wanted to check whether the distance between the SD sequence to the start codon varies between different species. If indeed there is a significant difference, we can consider the "preferred distance" of the wanted species when we build our plasmid to enhance its expression in the wanted species over the other organisms.


Figure 3: SD topology [3]

Our first POC deals with Escherichia coli and Bacillus subtilis, thus we examined the distances in those genomes. For each gene, we calculated the distance from the first nucleotide of the SD sequence until the first nucleotide that is translated (i.e., 'A' of the AUG codon).

Our work assumptions are:

  1. The SD sequence is the canonical SD sequence - ‘AGGAGG’.
  2. The subsequence in the genome with the most negative energy value is the sequence which hybridizes to the anti SD in the most stable way, hence it is the SD sequence.

We calculated the distance according to the algorithm described below:

  1. For each gene we took a sequence of 30 bases before the AUG until 1 base after.
  2. Using an energy map, for every six bases (subsequence) within the sequence, we checked the bond energy to the canonical anti-SD (aSD) sequence (i.e ‘UCCUCC’ for the canonical SD ‘AGGAGG’).
  3. We predicted that the subsequence with the most negative energy value has the strongest bond to the aSD, hence it is probably the SD sequence.
  4. We calculated the distance from the first base of the SD to the first base that is translated.

*The energy map is a table that contains the bond energy of the canonical aSD sequence to every possible six base pair-long RNA sequence. The table was made by Shir Bahiri from Tuller’s lab.


Figure 4: An example sequence from E. coli. ATG is the start codon, ACGAGG is the subsequence with the most negative energy bond to the aSD within this 3o bases long sequence, and the RED arrow marks the SD distance (8 in this case)

We can see the distributions of the distances for each species below:


Figure 5: B.subtilis histogram of SD distance from start translation site


Figure 6: E.coli: histogram of SD distance from start translation site

We found that the average distance is 8.253±4.559 for E .coli and 8.515±3.346 for Bacillus. 

However, while according to T-test and Mann-Whitney U-test, there is difference between the distribution of the distances (p-value of 2.984e-03 and 4.164e-61 respectively- compete in the evening), it is not different enough to conclude that one species strongly prefers a specific distance between the SD sequence to the start codon. Hence, we decided not to use this feature in our next software version.

References

  1. J. Starmer, A. Stomp, M. Vouk, and D. Bitzer, “Predicting shine–dalgarno sequence locations exposes genome annotation errors,” PLoS Computational Biology, vol. 2, no. 5, 2006. 
  2. S. B. Elitzur, R. Cohen-Kupiec, D. Yacobi, L. Fine, B. Apt, A. Diament, and T. Tuller, “Prokaryotic rRNA-mRNA interactions are involved in all translation steps and shape bacterial transcripts,” 2020.
  3. S. Bahiri Elitzur, R. Cohen-Kupiec, D. Yacobi, L. Fine, B. Apt, A. Diament, and T. Tuller, “Prokaryotic rRNA-mRNA interactions are involved in all translation steps and shape bacterial transcripts,” RNA Biology, pp. 1–15, 2021. 


CRISPR System Utilization

CRISPR (clustered regulatory interspaced short palindromic repeats) is a family of genetic sequences involved in the bacterial/archeal acquired immune system. 

By sequencing the CRISPR system of the different bacteria in the microbial community, we hope to identify crRNA (CRISPR RNA) that are uniquely present only in the deoptimized organisms. Regions complementary to the specified crRNA will be inserted into the designed plasmid along with the corresponding PAM sequence incorrect placing (similarly to the restriction sites), to promote selective cleavage and digestion of the plasmid in the deoptimized organism. 


Figure 7: CRISPR flowchart



Plasmid Family Approach

When we talked with Prof. Elhanan Borenstein about our optimization strategy, he brought up his concern about how optimal a one-plasmid solution could really be. If, for example, the organisms from the optimization group are distant from one another in terms of codon usage bias distribution, one plasmid which is optimized for both organisms would not be optimal for any of them. Instead, he suggested incorporating a multiple-plasmids solution, where each one of the plasmids is optimized for only a part of the optimization group - and deoptimized for the whole deoptimization organisms. This solution might lead us to a more optimal specific gene expression than the one-plasmid solution.

In order to implement this approach, we intend to use a metric that can divide optimization organisms into several groups. One of the ways to define such a metric for translation optimization is to calculate the distance between codon usage bias profiles of different organisms. Therefore, the metric will reflect the distance between two organisms in terms of evolutionary adaptation of mRNA translation. The smaller the distance, the more likely that two organisms enter the same group, and there is one plasmid which is optimized for both. 

However, the optimization cannot result with as many plasmids as we want, because of the restrictions of transformation in the lab. Physical restrictions tend to be flexible and can be later determined by the user. We intend for this method of optimization to be integrated into the next version of our software tool.