Software - ChloDB
Abstract
We have created a comprehensive chloroplast database that can provide deep insights into the development and engineering for the plants of tomorrow. To this end, we have compiled 6,065 chloroplast genomes into an easy-to-use, freely available, database containing well over 2.4 million sequence annotations.
In addition, considerable effort has been invested into reannotating and standardizing the data. In doing so, over 5,000 coding sequences, 77,000 tRNAs, and numerous rRNAs have been annotated to follow an unified naming convention.
The resulting database was successfully used in the design process of both our toolbox as well as the creation of our best basic part. To get more insight into how we have integrated the database into bioinformatics workflows, please visit our model page.
Introduction
Chloroplasts originate from a single endosymbiotic uptake of a cyanobacteria by a heterotroph protist approximately 1.7 billion years ago. Subsequently several optimizations between host cell and endosymbiont happened, such as the reduction of the endosymbionts genome and gene transfer to host nuclear genome, the development of a protein import machinery and metabolite exchange by membrane integrated transporters.
This evolution of the endosymbiont lead to a drastic genome reduction, and consequently today's chloroplast genomes consist of 100 - 250 genes and have a genome size typically ranging between 120kb and 160kb. Despite this evolution, chloroplast genomes from various species reveal a surprisingly conserved organization in terms of size, structure, and gene content [1].
Insights gained from analyzing these genome sequences have already greatly enhanced our understanding of plant biology. For instance, chloroplast genomes have been used in phylogenetic studies of several plant families and helped resolve evolutionary relationships within phylogenetic clades. With advances in both sequencing techniques as well as the development of various sequencing projects, a rapidly increasing amount of chloroplast genomes is available.
Although this represents a unique opportunity to gain profound insights into one of nature's most important machinery, few attempts have been made to aggregate this data into a comprehensive database. After an extensive search, we came across the following projects:
Unfortunately, many of these projects are either defunct or are limited to very specific applications. Due to the
high potential and lack of alternatives, we believe that such a database could add significant value for both iGEM
teams as well as the scientific community at large. With our project in mind, we immediately realized that the
development of such a database could play a central role for OpenPlast.
We believe that since the endosymbiotic event 1.7 billion years ago, these organelles have traversed large portions of the evolutionary sequence space and could therefore shed light on the space of possible regulatory sequences. To this end, we were keen on how this data could be utilized to inform the design process of our toolbox. Going beyond our project, we envision that such information may also prove particularly valuable to our understanding of climate adaptation of economically important plants.
Considerations
In order to ensure that such a database is of use to iGEM teams and other stakeholders, many considerations had to be taken into account. We, therefore, set out to once again analyze the limitations of previous implementations.
Here, we quickly noticed that many of the tools listed are solely available as a website, and do not provide the underlying data for download. While this approach certainly offers a user-friendly interface, potentially opening it up to a wider audience, it significantly hinders most common bioinformatics tasks and pipelines. Additionally, the lack of locally available data is especially troublesome for projects which rely on such tools, seeing as these sites can be discontinued without prior warning.
Hence, it was important to us that the database we create is freely available for download.
With this decision in mind, we set out to search for a solution that could support major bioinformatic workflows
while simultaneously being flexible enough to support novel approaches. Additionally, it should scale up to the
ever-increasing amount of available chloroplast genomes and offer high-performance for large data queries.
One natural fit for such requirements comes in the form of NoSQL databases.
Compared to classical Relational Data Base Management Systems (RDBMS), NoSQL doesn’t need a fixed schema, avoids
joints, and offers great scalability. Removing the need for rigid schemas offers us greater flexibility to handle
the inherently complex and hierarchical data of chloroplast genomes. Introduced in 1998, this approach has seen
widespread adaption for both industry giants and startups alike. With an estimated market worth $22 billion by
2026 and an ever-growing number of Databases Management System and tools, we searched for a solution that would
satisfy our requirements.
In the selection process, we placed high emphasis on solutions that:
- can be used from all major programming languages, including Python, Java, and C.
- is open-source
- well-maintained
- future proof
- free to use
- and easy to use
Implementation
One of the solutions we came across was MongoDB, a schema-free, document-oriented database that uses collection-oriented storage. Released in 2009, MongoDB quickly rose in popularity and now secures roughly a quarter of the market share of all database systems, making it by far the most popular NoSQL solution.
In order to further facilitate the broad application of our project, we have chosen to use Python, given that it is both increasingly widespread and relatively easy to learn.
Once we had decided on the tools, we set out to put our ambitious goals into practice and started the hunt for chloroplast genomes. In order to obtain as many chloroplast genomes as possible, we conducted thorough literature research as well as an exhaustive analysis of thousands of NCBI-Entrez database entries.
The resulting dataset of nearly 8,000 possible chloroplast genomes was closely examined and filtered for duplicates and false positives entries. After processing the data, we were left with 6,065 unique chloroplast genomes which now had to be embedded in the database.
For this, we have chosen a scheme that closely resembles that of the GenBank format.
Using such a standardized and well-established format has two major advantages over a novel approach:
- Already established bioinformatics pipelines that utilize such formats can easily be adapted to the database, making it much easier to work with.
- Downstream processing and data sharing are made possible, as data can both be imported as well as exported in the desired format.
Data Cleaning and Curation
The goal is to turn data into information and information into insight.
The mere accumulation of data does not always translate into insight.
This was particularly evident to us when we first looked into the datasets and performed
the initial data analysis.
In doing so, we found that, despite the efforts of many scientists to use the existing naming conventions, a
significant number of annotations greatly deviated from these standards.
Such a discrepancy in sequence annotations severely affects the overall quality of the dataset and makes it
considerably more difficult to fully utilize.
Consequently, the curation and cleaning of our data became our top priority.
According to naming conventions, chloroplast-encoded genes are named using three lowercase letters defining the
protein or protein complex followed by an uppercase letter defining the subunit encoded by the specific gene (e.g.
"psbA").
Although many discrepancies were caused by incorrect capitalization, we faced a multitude of additional problems
that couldn’t be addressed so easily.
One such example is a gene that was initially named psbG.
Believed to be part of the PS-II complex, it was later revised to be a component of a chloroplast-located
NADPH/quinone oxidoreductase and is now known as ndhK.
Adding to the confusion, a transmembrane helix protein was later found in red algal which now bears the name
psbG.
To address these and similar issues, we made extensive use of BLAST [7] to re-annotate both wrong and previously unnamed
coding sequences.
The greatest difficulties, however, occurred where no generally accepted convention existed, namely in the case of
non-protein-coding tRNA and rRNA.
In the case of tRNAs, for instance, we found over 3,000 different spellings.
For example, the tRNA trnL-UAG appeared under 23 different names (trnL-TAG, trnL(UAG), trnL(uag), trnL (UAG), trnL_UAG, trnL(tag), tRNA-Leu (UAG), trnL_uag, tRNA-Leu(UAG), trnL(TAG), tRNA-UAG, trnI-UAG, tRNA-Leu(TAG), trnLUAG, tRNA-Leu(tag), trnL-CAA; trnL-UAG, tRNA-Leu (TAG), Leu_UAG, trnL-uag, trnL-UAG; trnL-CAA, trnL(uag)' ).
To unify the naming of those genes, we used the convention proposed by the Max Planck Institute of Molecular Plant Physiology:
trnAMINOACID ONE-LETTER-CODE-ANTICODON, e.g. trnL-UAG
Using existing annotations, we were able to relable 45,543 tRNAs.
Transfer-RNAs whose annotations were either lacking the necessary information or contained contradictory data
posed a particular challenge to us.
To address this issue, we utilized ARAGORN for the de novo annotation of 31,602 sequences [8].
Similarly, all rRNAs have been renamed to use the format:
rrnSIZE IN SVEDBERG UNITS without the "S" e.g. rrn16
Lastly, we took a look at non-gene annotations, focusing on inverted repeats (IR), which are known to be present
in most chloroplast genomes.
To identify all missing IRs, we self-alignment the genome via BLAST and set the cutoff to 500bp.
Data Cleaning and Curation
Outlook
As the number of available chloroplast genomes continues to grow, we plan to adapt and expand our database well after the iGEM competition. Besides adding new data, we also aim to further improve its quality .
To this end, we want to perform a detailed investigation of hypothetical proteins and open reading frames (ORFs). Since the current naming convention for ORFs only encodes the number of amino acids, valuable information about their interrelationships and possible functions remains undeveloped.
To combat this, we want to utilize both the database as well as our model to perform both multiple sequence alignment and clustering algorithms to gain deeper insight into these genes.
Sources
- Fu, J., Liu, H., Hu, J., Liang, Y., Liang, J., Wuyun, T., & Tan, X. (2016). Five Complete Chloroplast Genome Sequences from Diospyros: Genome Organization and Comparative Analysis. In B. Heinze (Ed.), PLOS ONE (Vol. 11, Issue 7, p. e0159566). Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0159566
- O’Brien, E. A., Zhang, Y., Wang, E., Marie, V., Badejoko, W., Lang, B. F., & Burger, G. (2009). GOBASE: an organelle genome database. In Nucleic Acids Research (Vol. 37, Issue Database, pp. D946–D950). Oxford University Press (OUP). https://doi.org/10.1093/nar/gkn819
- Cui, L. (2006). ChloroplastDB: the Chloroplast Genome Database. In Nucleic Acids Research (Vol. 34, Issue 90001, pp. D692–D696). Oxford University Press (OUP). https://doi.org/10.1093/nar/gkj055
- Heinze, B. (2007). A database of PCR primers for the chloroplast genomes of higher plants. In Plant Methods (Vol. 3, Issue 1, p. 4). Springer Science and Business Media LLC. https://doi.org/10.1186/1746-4811-3-4
- Zhang, R., Ge, F., Li, H., Chen, Y., Zhao, Y., Gao, Y., Liu, Z., & Yang, L. (2019). PCIR: a database of Plant Chloroplast Inverted Repeats. In Database (Vol. 2019). Oxford University Press (OUP). https://doi.org/10.1093/database/baz127
- Singh, B. P. (2020). CpGDB : A Comprehensive Database of Chloroplast Genomes. In Bioinformation (Vol. 16, Issue 2, pp. 171–175). Biomedical Informatics. https://doi.org/10.6026/97320630016171
- Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. In Journal of Molecular Biology (Vol. 215, Issue 3, pp. 403–410). Elsevier BV. https://doi.org/10.1016/s0022-2836(05)80360-2
- Laslett, D. (2004). ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. In Nucleic Acids Research (Vol. 32, Issue 1, pp. 11–16). Oxford University Press (OUP). https://doi.org/10.1093/nar/gkh152