Team:Marburg/Software

Software - ChloDB

Abstract

We have created a comprehensive chloroplast database that can provide deep insights into the development and engineering for the plants of tomorrow. To this end, we have compiled 6,065 chloroplast genomes into an easy-to-use, freely available, database containing well over 2.4 million sequence annotations.

In addition, considerable effort has been invested into reannotating and standardizing the data. In doing so, over 5,000 coding sequences, 77,000 tRNAs, and numerous rRNAs have been annotated to follow an unified naming convention.

The resulting database was successfully used in the design process of both our toolbox as well as the creation of our best basic part. To get more insight into how we have integrated the database into bioinformatics workflows, please visit our model page.

**Figure 1: ChloDB in Mongo Compass**
Exemplary genome entry in our chloroplast database, visualized with the help of Mongo Compass

Introduction

Chloroplasts originate from a single endosymbiotic uptake of a cyanobacteria by a heterotroph protist approximately 1.7 billion years ago. Subsequently several optimizations between host cell and endosymbiont happened, such as the reduction of the endosymbionts genome and gene transfer to host nuclear genome, the development of a protein import machinery and metabolite exchange by membrane integrated transporters.

This evolution of the endosymbiont lead to a drastic genome reduction, and consequently today's chloroplast genomes consist of 100 - 250 genes and have a genome size typically ranging between 120kb and 160kb. Despite this evolution, chloroplast genomes from various species reveal a surprisingly conserved organization in terms of size, structure, and gene content [1].

Insights gained from analyzing these genome sequences have already greatly enhanced our understanding of plant biology. For instance, chloroplast genomes have been used in phylogenetic studies of several plant families and helped resolve evolutionary relationships within phylogenetic clades. With advances in both sequencing techniques as well as the development of various sequencing projects, a rapidly increasing amount of chloroplast genomes is available.

Although this represents a unique opportunity to gain profound insights into one of nature's most important machinery, few attempts have been made to aggregate this data into a comprehensive database. After an extensive search, we came across the following projects:

GOBASE (Organelle Genome Database) [2],
ChloroplastDB [3]
Database of PCR Primers of Chloroplast Genomes [4]
PCIR (Database of Plant Chloroplast Inverted Repeats) [5].
cpGDB [6]

Unfortunately, many of these projects are either defunct or are limited to very specific applications. Due to the high potential and lack of alternatives, we believe that such a database could add significant value for both iGEM teams as well as the scientific community at large. With our project in mind, we immediately realized that the development of such a database could play a central role for OpenPlast.

We believe that since the endosymbiotic event 1.7 billion years ago, these organelles have traversed large portions of the evolutionary sequence space and could therefore shed light on the space of possible regulatory sequences. To this end, we were keen on how this data could be utilized to inform the design process of our toolbox. Going beyond our project, we envision that such information may also prove particularly valuable to our understanding of climate adaptation of economically important plants.

Considerations

In order to ensure that such a database is of use to iGEM teams and other stakeholders, many considerations had to be taken into account. We, therefore, set out to once again analyze the limitations of previous implementations.

Here, we quickly noticed that many of the tools listed are solely available as a website, and do not provide the underlying data for download. While this approach certainly offers a user-friendly interface, potentially opening it up to a wider audience, it significantly hinders most common bioinformatics tasks and pipelines. Additionally, the lack of locally available data is especially troublesome for projects which rely on such tools, seeing as these sites can be discontinued without prior warning.

Hence, it was important to us that the database we create is freely available for download.
With this decision in mind, we set out to search for a solution that could support major bioinformatic workflows while simultaneously being flexible enough to support novel approaches. Additionally, it should scale up to the ever-increasing amount of available chloroplast genomes and offer high-performance for large data queries.

One natural fit for such requirements comes in the form of NoSQL databases.
Compared to classical Relational Data Base Management Systems (RDBMS), NoSQL doesn’t need a fixed schema, avoids joints, and offers great scalability. Removing the need for rigid schemas offers us greater flexibility to handle the inherently complex and hierarchical data of chloroplast genomes. Introduced in 1998, this approach has seen widespread adaption for both industry giants and startups alike. With an estimated market worth $22 billion by 2026 and an ever-growing number of Databases Management System and tools, we searched for a solution that would satisfy our requirements.

In the selection process, we placed high emphasis on solutions that:

can be used from all major programming languages, including Python, Java, and C.
is open-source
well-maintained
future proof
free to use
and easy to use

Implementation

One of the solutions we came across was MongoDB, a schema-free, document-oriented database that uses collection-oriented storage. Released in 2009, MongoDB quickly rose in popularity and now secures roughly a quarter of the market share of all database systems, making it by far the most popular NoSQL solution.

In order to further facilitate the broad application of our project, we have chosen to use Python, given that it is both increasingly widespread and relatively easy to learn.

Once we had decided on the tools, we set out to put our ambitious goals into practice and started the hunt for chloroplast genomes. In order to obtain as many chloroplast genomes as possible, we conducted thorough literature research as well as an exhaustive analysis of thousands of NCBI-Entrez database entries.

The resulting dataset of nearly 8,000 possible chloroplast genomes was closely examined and filtered for duplicates and false positives entries. After processing the data, we were left with 6,065 unique chloroplast genomes which now had to be embedded in the database.

For this, we have chosen a scheme that closely resembles that of the GenBank format.
Using such a standardized and well-established format has two major advantages over a novel approach:

Already established bioinformatics pipelines that utilize such formats can easily be adapted to the database, making it much easier to work with.
Downstream processing and data sharing are made possible, as data can both be imported as well as exported in the desired format.

**Figure 2: ChloDB Schema**
Overview of the different collections in our database. Field types where taken from MongoEngine.

Data Cleaning and Curation

The goal is to turn data into information and information into insight.

The mere accumulation of data does not always translate into insight.
This was particularly evident to us when we first looked into the datasets and performed the initial data analysis.
In doing so, we found that, despite the efforts of many scientists to use the existing naming conventions, a significant number of annotations greatly deviated from these standards.

Such a discrepancy in sequence annotations severely affects the overall quality of the dataset and makes it considerably more difficult to fully utilize.
Consequently, the curation and cleaning of our data became our top priority.

According to naming conventions, chloroplast-encoded genes are named using three lowercase letters defining the protein or protein complex followed by an uppercase letter defining the subunit encoded by the specific gene (e.g. "psbA").
Although many discrepancies were caused by incorrect capitalization, we faced a multitude of additional problems that couldn’t be addressed so easily.
One such example is a gene that was initially named psbG.
Believed to be part of the PS-II complex, it was later revised to be a component of a chloroplast-located NADPH/quinone oxidoreductase and is now known as ndhK.
Adding to the confusion, a transmembrane helix protein was later found in red algal which now bears the name psbG.

To address these and similar issues, we made extensive use of BLAST [7] to re-annotate both wrong and previously unnamed coding sequences.
The greatest difficulties, however, occurred where no generally accepted convention existed, namely in the case of non-protein-coding tRNA and rRNA.
In the case of tRNAs, for instance, we found over 3,000 different spellings.

For example, the tRNA trnL-UAG appeared under 23 different names (trnL-TAG, trnL(UAG), trnL(uag), trnL (UAG), trnL_UAG, trnL(tag), tRNA-Leu (UAG), trnL_uag, tRNA-Leu(UAG), trnL(TAG), tRNA-UAG, trnI-UAG, tRNA-Leu(TAG), trnLUAG, tRNA-Leu(tag), trnL-CAA; trnL-UAG, tRNA-Leu (TAG), Leu_UAG, trnL-uag, trnL-UAG; trnL-CAA, trnL(uag)' ).

To unify the naming of those genes, we used the convention proposed by the Max Planck Institute of Molecular Plant Physiology:

trnAMINOACID ONE-LETTER-CODE-ANTICODON, e.g. trnL-UAG

Using existing annotations, we were able to relable 45,543 tRNAs.
Transfer-RNAs whose annotations were either lacking the necessary information or contained contradictory data posed a particular challenge to us.
To address this issue, we utilized ARAGORN for the de novo annotation of 31,602 sequences [8].

Similarly, all rRNAs have been renamed to use the format:

rrnSIZE IN SVEDBERG UNITS without the "S" e.g. rrn16

Lastly, we took a look at non-gene annotations, focusing on inverted repeats (IR), which are known to be present in most chloroplast genomes.
To identify all missing IRs, we self-alignment the genome via BLAST and set the cutoff to 500bp.

**Figure 3: Statistics of the final database**

Outlook

As the number of available chloroplast genomes continues to grow, we plan to adapt and expand our database well after the iGEM competition. Besides adding new data, we also aim to further improve its quality .

To this end, we want to perform a detailed investigation of hypothetical proteins and open reading frames (ORFs). Since the current naming convention for ORFs only encodes the number of amino acids, valuable information about their interrelationships and possible functions remains undeveloped.

To combat this, we want to utilize both the database as well as our model to perform both multiple sequence alignment and clustering algorithms to gain deeper insight into these genes.

Sources

Fu, J., Liu, H., Hu, J., Liang, Y., Liang, J., Wuyun, T., & Tan, X. (2016). Five Complete Chloroplast Genome Sequences from Diospyros: Genome Organization and Comparative Analysis. In B. Heinze (Ed.), PLOS ONE (Vol. 11, Issue 7, p. e0159566). Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0159566
O’Brien, E. A., Zhang, Y., Wang, E., Marie, V., Badejoko, W., Lang, B. F., & Burger, G. (2009). GOBASE: an organelle genome database. In Nucleic Acids Research (Vol. 37, Issue Database, pp. D946–D950). Oxford University Press (OUP). https://doi.org/10.1093/nar/gkn819
Cui, L. (2006). ChloroplastDB: the Chloroplast Genome Database. In Nucleic Acids Research (Vol. 34, Issue 90001, pp. D692–D696). Oxford University Press (OUP). https://doi.org/10.1093/nar/gkj055
Heinze, B. (2007). A database of PCR primers for the chloroplast genomes of higher plants. In Plant Methods (Vol. 3, Issue 1, p. 4). Springer Science and Business Media LLC. https://doi.org/10.1186/1746-4811-3-4
Zhang, R., Ge, F., Li, H., Chen, Y., Zhao, Y., Gao, Y., Liu, Z., & Yang, L. (2019). PCIR: a database of Plant Chloroplast Inverted Repeats. In Database (Vol. 2019). Oxford University Press (OUP). https://doi.org/10.1093/database/baz127
Singh, B. P. (2020). CpGDB : A Comprehensive Database of Chloroplast Genomes. In Bioinformation (Vol. 16, Issue 2, pp. 171–175). Biomedical Informatics. https://doi.org/10.6026/97320630016171
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. In Journal of Molecular Biology (Vol. 215, Issue 3, pp. 403–410). Elsevier BV. https://doi.org/10.1016/s0022-2836(05)80360-2
Laslett, D. (2004). ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. In Nucleic Acids Research (Vol. 32, Issue 1, pp. 11–16). Oxford University Press (OUP). https://doi.org/10.1093/nar/gkh152