Software
Introduction
Our software aims to tailor the expression of genetic information to any microbial population, producing a regulated gene ready for transformation along with relevant documentation and predictions, presented in a user-friendly manner. Computational and experimental validations have received results beyond our expectations. Using the Enableaccessibility add-on, appropriate graphic features were added to ensure full accessibility to all users, while implementing the pipeline intuitively using a stepwise interactive framework.
Corresponding to the flexibility of metagenomic data, our software applies to communities with various degrees of characterization; from mere annotated genomes to documented gene expression levels, along with optional advanced features which are fully customizable.
Our software is available to all users, in all operating systems, by using our independently developed docker image!
Just install the following .zip file - Communique_install.zipand follow our user guide and video below!
Figure 1: Software Description Diagram
Description
Inputs
In order to apply our software, there are several basic inputs the user must insert:
Microbial composition:
First, the user must insert the data about the microbial composition which will be used for the optimization process:
- The number of microorganisms in the microbial community, and specifically which ones are to be optimized and which to be deoptimized
- A GenBank file containing the whole genome (with annotations) for each one of the microorganisms
- The tuning parameter, which controls tradeoff between optimization for optimized organisms and deoptimized organisms
- Advanced option: expression data - in order for users to utilize the full potential of the characterization of the microbiome they are engineering, a .csv file of expression levels can be inserted and used for extraction of the highly expressed genes for the model. Any measurements can be used if they are inserted in the correct format:
- The gene names should be the same as the NCBI gene names of the organism
- Column with the gene names is labeled "gene"
- Column with the gene expression levels is labeled "mRNA_level"
Optimization preferences:
After characterizing the microbiome, the user must decide which optimization modules to apply:
- Coding sequence optimization: the user must decide if they would like to apply the following optimization modules:
- - Restriction enzyme optimization
- - Translation efficiency optimization
- Advanced option: the user can decide which optimization function is used. The default choice is the average hill climbing function.
- Transcription optimization: the user must decide if they would like to receive a microbiome-specific promoter to regulate the transcription of their gene of interest.
- * Advanced option: the user can decide to optimize promoters from an external list (also suitable for the case of synthetic ones) instead of the default optimization which is performed on the organism’s endogenous promoters. In that case, a FASTA file containing the sequences of the users promoters should be uploaded.
Optimization flow
Data processing: for all organisms, the following properties are extracted for all organisms:
- Codon usage bias information
- Promoters and intergenic regions
There are two main units in our software: the coding sequence unit and the transcription unit. These units are independent and work separately.
- Coding sequence unit: first the original sequence is inputted to the translation model, and an optimization is performed based on codon usage bias indices. Then, this sequence is passed to the restriction enzyme model and the restriction sites are optimized, so the final sequence is obtained. For this final optimized sequence, optimization indices are calculated.
- Transcription unit: a promoter list (whether an external list by the user or internal list) is inputted to the transcription optimization model along with endogenous promoters and intergenic sequences and the calculations of the algorithm are performed. At the end, a different statistical analysis of the results is performed.
Outputs
The following results will be printed on the users screen:
- Coding sequence optimization: the optimized DNA sequence of the desired gene and two optimization statistics:
- Weakest link score
- Optimization index score
- The location of a FASTA file including the re-coded sequence
- Transcription optimization: The most selective promoter (including its description) is returned along with:
- A synthetic version of it, which is likely to be even more selective
- The E-value of the promoter
- The location of a FASTA file containing the two promoters
- Path to the results.zip file: the results zip contains the produced FASTA files along with log files from all different models, which contain all appropriate information regarding the analysis of the genomic composition and data from the microbiome and additional details about the optimization, i.e. characterization of the restriction enzymes which recognize the new sequence, the final codon usage bias measurement used for the calculation, additional statistical calculations for all optimization steps.
Software Analysis
In the conversation with Prof. Elhanan Borenstein, he highlighted the necessity of appropriate analysis of the degree of separation in expression that our software is able to achieve and recommended us to do so by examining performance for different sizes of communities and their evolutionary closeness.
Use Case: selectively optimizing phage resistance for soil microbiome
Selected microbiome
The microbiome we chose to analyze was recommended by Dr. Nadav Kashtan - who pointed us towards an article characterizing the functional overlap of the Arabidopsis leaf and root microbiota [5]. We chose to work with the described soil microbiome due to to following reasons:
- From the beginning of software development, “soil-microbiome therapy”, along with other agro-tech initiatives, was recognized as one of the leading applications of our project. Thus, we were eager to examine our optimization capabilities in this environment.
- Working with leaf and root microbiomes, instead of the soil microbiome, increases the risk of MAMP (microbiome-associated molecular patterns) to initiate a defense response in plants [3].
Figure 2: Arabidopsis soil microbiome
The data from the article presents three taxonomic levels of the organisms in the soil microbiome - Phylum, Family, and Genus, and their 16S rRNA sequences for 34 bacterial species.
In order to find the correct genome for the detected species, we checked each 16S rRNA sequence in the BLAST rRNA software and found which organism has the best percent identity. We downloaded their genomes from NCBI in GenBank files, so that they would match our software’s required file. If the genome sequence of the organism with the highest percent identity didn’t exist, we took the organism with the highest percent identity after it. We didn’t take organisms with percent identity lower than 98.5%.
The data used for this analysis can be found in our GitHub repository: TAU_Israel 2021
Selected gene
After consultation with Dr. Lianet Noda, we reviewed many possible options for genes. In the end, we decided to showcase our results for optimizing an anti-phage defense gene in our community, as it can be used in a wide array of sub-populations for various different purposes, showcasing the flexibility of this framework. Specifically, we worked with the NCBI record of the ZorA gene, which is part of the Zorya defense system that is inferred to be involved with membrane polarization and infected cell death [1].
Analysis process
First analysis: scale-up analysis
In order to be able to use our model for industrial or academic microbiome engineering, we should be able to optimize any gene of interest for various microbiomes with different sizes, degrees of complexity, and diversity.
Therefore, wanted to check the quality of our model depending on the size of the input organisms groups and their taxonomy level conservation. To do so, we decided to choose an evenly sized sub-microbiome of x organisms randomly, such that x belongs to [2,34], and split the organisms into two groups of similar sizes, so that each running different taxonomy level is conserved in the split group.
Sub-microbiome splitting method: For each organism in the sub-microbiome and taxonomic organization level, we grouped organisms from the same taxonomy together and sorted these groups according to their size. The first split group was defined as the odd indices in the sorted group’s list, and the second one was defined as the even indices in the sorted group’s list. In order to make sure that the size of the two split-groups was as equal as possible, we did 50 different splittings and took the splitting in which the group sizes were as close as possible, and then randomly assigned one to be the optimized group and one the deoptimized group.
Figure 3: Separation of optimized and deoptimized organisms
After defining the optimized and deoptimized group, we applied our protein optimization software (using neutral preference between optimization and deoptimization, meaning setting our tuning parameter as 0.5) and calculated the optimization score for all four optimization functions. This whole process was repeated ten times, and the average result plotted:
Figure 4: Optimization for different sub-microbiome sizes
- Figure 4.a: Average score optimization
- Figure 4.b: Weakest link score optimization
- Figure 4.c: Single codon global optimization
- Figure 4.d: Single codon local optimization
Overall, more variation is seen in the smaller microbiomes as they differ more from each other. As we expected, for every sub-microbiome size, the score is better when the genomes are more distant evolutionarily. In other words, the optimization score is the highest when phyla are conserved and the lowest when only the genera are conserved.
When comparing the two optimization strategies- hill-climbing (Fig. 4a and b) and single codon (Fig. 4c and d), the hill-climbing method is evidently better than the single codon view. This result is interesting on its own, and traditional codon usage bias optimization, also called codon harmonization, is performed in a single codon manner [2]. Moreover, analyzing the difference between the two hill-climbing methods- Average (Fig. 4a) and weakest link (Fig. 4b), it is evident that when only genera are conserved in large sub-microbiomes, it is very hard to achieve optimization of the weakest link, or in other words, completely separate the clusters of the optimized and deoptimized organisms. However, even in that scenario, optimization of the average score can be obtained.
Lastly, even in naturally occurring populations, a significant degree of optimization is observed in nearly all cases, demonstrating the feasibility of the optimization process in potential environments.
Second analysis: evolutionary correlation
After checking the performance of our models in potential environments, we were interested to know how much evolutionary distance between two organisms impacts the performance of our software.
In order to examine this question, we needed to define a continuous measure for the evolutionary distance between bacterial species.
Evolutionary distance measurement: The ribosomal genes are highly conserved between different species [4], therefore alignment between them is a fit measurement for the evolutionary distance between bacteria. More specifically, in order to utilize the data we had more precisely, we started by performing multiple sequence alignment (MSA) for all 16S rRNA sequences published in the article and defined the distance as the number of different characters between the two sequences.
For the analysis itself, we performed the following steps for all pairs of species in the root microbiome:
- Calculated the evolutionary distance between the pair
- Randomly assigned one to be optimized and the other to be deoptimized
- Applied our software to the defined pair
Figure 5: Defined optimized and deoptimized groups
We performed this analysis for both optimization strategies (the difference between two functions using the same strategy is exhibited only when the defined microbiome contains more than 2 organisms) and plotted the results in the following graph:
Figure 6: Optimization score over increasing evolutionary distance
Firstly, as seen in the previous analysis, the hill-climbing approach almost always exceeds the single codon approach. The spearman correlation between the evolutionary distance and the optimization score is larger and more significant in the case of hill-climbing optimization (correlation = 0.7345, p-value = 2.009E-158) compared to the single codon optimization (correlation = 0.4304, p-value = 3.221E-43).
In other words, the hill-climbing optimization works even for very closely related organisms and becomes even better as the species become more evolutionarily distant, exhibiting the superiority of this novel method over the traditional one.
To conclude the discussion about the correlation between model performance and evolutionary distance, the following point must be taken into account:
If organisms are very closely related evolutionarily, our models might not be able to perform the optimization as selectively as needed. However, this is not as problematic as it may seem due to two main points:
- If two species are very close evolutionarily and functionally, they are likely to exploit the same environmental niche and thus will be less likely to co-exist together. [6]
- In cases where the first point is untrue, expression separation between those two species might not be prioritized due to their close functional similarity.
Thus, in general, this computational analysis has exceeded our initial scale-up goals for our software, and we are truly excited to test this optimization strategy in vitro and in-vivo as well.
References
[1] A. H. Azam and Y. Tanji, “Bacteriophage-host arm race: An update on the mechanism of phage resistance in bacteria and revenge of the phage with the perspective for phage therapy,” Applied Microbiology and Biotechnology, vol. 103, no. 5, pp. 2121–2131, 2019.
[2] E. Angov, C. J. Hillier, R. L. Kincaid, and J. A. Lyon, “Heterologous protein expression is enhanced by harmonizing the codon usage frequencies of the target gene with those of the expression host,” PLoS ONE, vol. 3, no. 5, 2008.
[3] M. Voges, "Molecular Principles to Engineer Plant Microbiomes." Order No. 28209498, Stanford University, Ann Arbor, 2019.
[4] N. Yutin, P. Puigbò, E. V. Koonin, and Y. I. Wolf, “Phylogenomics of Prokaryotic ribosomal proteins,” PLoS ONE, vol. 7, no. 5, 2012.
[5] Y. Bai, D. B. Müller, G. Srinivas, R. Garrido-Oter, E. Potthoff, M. Rott, N. Dombrowski, P. C. Münch, S. Spaepen, M. Remus-Emsermann, B. Hüttel, A. C. McHardy, J. A. Vorholt, and P. Schulze-Lefert, “Functional overlap of the Arabidopsis leaf and root microbiota,” Nature, vol. 528, no. 7582, pp. 364–369, 2015.
[6] N. Tromas, Z. E. Taranu, B. D. Martin, A. Willis, N. Fortin, C. W. Greer, and B. J. Shapiro, “Niche separation increases with genetic distance among bloom-forming cyanobacteria,” Frontiers in Microbiology, vol. 9, 2018.
Software Safety Scan
We understand the risks and biosafety hazards involved in engineering an environment directly (read more about this in Safety). Our conversation with Professor Dov Greenbaum, Director of the Zvi Meitar Institute for Legal Implications of Emerging Technologies, motivated us to integrate safety checks that might prevent some of the associated risks.
The core of our project targets one type of safety hazard: the threat of genetically engineered components being spread uncontrollably in the environment, altering populations, and possibly creating irrecoverable damage. However, safety concerns come in many different flavors: What if we create software that is harmful? What if someone uses our software for malicious purposes, or meddles with our code to create their own version, which might not be as well-reviewed and thought-out and could cause harm?
These are all questions we were presented with during our meetings with professionals in this field (you can read more about them on the Integrated HP page). It motivated us to delve into existing research and past iGEM projects that may have dealt with this issue, and sure enough, we found that the 2017 Heidelberg team, named "The Phage and the Furious",developed a software that is all about safety. Incorporating state-of-the-art artificial intelligence architecture, the SafetyNet tool predicts the level of danger that an input sequence imposes, based on the organism it's originated from and its cellular function. We were especially interested in DeeProtein [1] – the module that predicts sequence functionality and was finetuned after the competition. Integrating it in our software will allow us to create a kill-switch for it and simply not perform our optimizations if the gene is predicted to be dangerous.
To do so, we set up a meeting with Julius Upmeier zu Belzen, one of the contributors to DeeProtein. He was kind enough to answer our many questions regarding the software, provide us with additional resources, and even challenge us with his own inquiries about our project and the ways it tackles safety issues.
Unfortunately, due to our limited time and DeeProtein being currently out of order, we weren't able to successfully incorporate this safety check in our finished software. Nevertheless, it did give us inspiration – DeeProtein utilizes Gene Ontology (GO) annotation, which is a system that represents gene attributes and functions for genes in all species. In particular, DeeProtein uses a list of GO terms that are associated with any type of pathogenic activity and its neural network is trained to recognize them. Julius kindly provided us with this list of GO terms, and we decided to focus our search now on GO-predicting tools that we could integrate in our software in a timely manner.
After looking into tools such as NetGO 2.0 [2] and DeepGOPlus [3], we are currently working on incorporating the latter into the Communique software, and hoping to release an updated and safer version soon!
References
- J. Upmeier zu Belzen, T. Bürgel, S. Holderbach, F. Bubeck, L. Adam, C. Gandor, M. Klein, J. Mathony, P. Pfuderer, L. Platz, M. Przybilla, M. Schwendemann, D. Heid, M. D. Hoffmann, M. Jendrusch, C. Schmelas, M. Waldhauer, I. Lehmann, D. Niopek, and R. Eils, “Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins,” Nature Machine Intelligence, vol. 1, no. 5, pp. 225–235, 2019.
- S. Yao, R. You, S. Wang, Y. Xiong, X. Huang, and S. Zhu, “NetGO 2.0: Improving large-scale protein function prediction with massive sequence, text, domain, family and network information,” Nucleic Acids Research, vol. 49, no. W1, 2021.
- M. Kulmanov and R. Hoehndorf, “DeepGOPlus: Improved protein function prediction from sequence,” Bioinformatics, 2019.
Software User Manual
User Video