Team:Marburg/Model

Model

Abstract

The increasing amount of data will present a daunting challenge for future generations of scientists. In order to address these issues and accelerate the data analysis pipeline for our chloroplast database, we have investigated multiple state-of-the-art machine learning methods. We used these novel approaches to drastically reduce the time needed for classical bioinformatic analysis. This allowed us to compute the Steiner-String sequence of nearly 7000 rrn16 promoters. Using this approach, we gained insights into the regulatory sequence space of chloroplast promoters and were able to design and construct novel synthetic regulatory parts to be used in chloroplasts of numerous plants.

Introduction

We live in an increasingly data-driven world, where high-throughput sequencing and multiplex experiments are transforming biology into an information science. As a result, the main challenges in biological research have shifted from data generation and processing to interpretation and data analysis.


While synthetic biology is quickly embracing this new direction, it is evident that the effective analysis and interpretation of the ever-increasing amount of data will present a daunting challenge for future generations of scientists. One of the potential causes for this problem could be, that despite the development of increasingly sophisticated tools, the underlying core algorithms of bioinformatics have not been improved for decades. Consequently, to bring biology into the 21st century of big data science, an in-depth assessment of the bioinformatics foundation might be needed.

Underlying Problem

At the core of many of biology’s most used algorithms lies the estimation of the so-called edit distance kernel D(s1, s2). This distance can be thought of as a similarity measure that calculates the minimum number of change, insert and delete operation required to transform one sequence to another. While seemingly straightforward, its computation with traditional methods proofs to be rather difficult. To this end, all classical implementations are bound to quadratic complexity and are hardly parallelizable, making it a major bottleneck for analyzing large-scale biological datasets. Problems that in other spaces are relatively simple therefore become combinatorially hard in the space of sequences defined by this edit distance.


This challenge also extends to important methods such as multiple sequence alignment and hierarchical clustering, algorithms whose exact calculation is computationally intractable even for a small number of sequences. In contrast to other methods that have seen widespread adaptation in data sciences, these algorithms are data-independent and do not make use of the underlying data.


Although this may sound like an advantage at first, they lack the possibility of dimensionality reduction, putting them at a severe disadvantage compared to their counterparts. In this context, the assumption is that all real-world data has an intrinsic structure that can be can be exploited and used to significantly decrease its dimensionality compared to the original input space. Leveraging this method to create a data-dependent heuristic therefore presents a promising potential to effectively speed up the analysis of large biological data sets.

Solution

Evidently, the question arises as to why such methods haven’t seen wide adoption in bioinformatics yet. One of the possible causes is that, unlike most other problems in machine learning, problems related to string matching are often discretely formulated, making them difficult to employ in current deep-learning approaches. Furthermore, representation learning methods based on euclidean space often fail to capture the complex evolutionary hierarchical structure of real-world biological data.


In order to address these issues and accelerate the data analysis pipeline for our chloroplast database, we have investigated multiple state-of-the-art machine learning methods. We consequently came across a novel method called “Neural Distance Embeddings for Biological Sequences” (NeuroSEED), a general framework that embeds sequences in a geometric vector space, making it possible to quickly approximate the edit distance of sequences.

Figure 1: Key Idea of NeuroSEED
(Corso et. al, 2021) Learn an encoder function f that preserves distances between the sequence and vector space.

One of the key ideas of NeuroSEED is to learn an encoder function that preserves both the distances between the sequences and the vector space. Once embedded into this continuous vector space, it can be used to study the relationship between sequences and potentially decode new ones.


Since Euclidean spaces are ill-suited for the complex nature of biological sequences, a hyperbolic embedding was chosen. Oftentimes referred to as a continuous version of a tree, this geometry is perfectly suited for the embedding of the highly hierarchical biological sequence data.

Implementation

To put theory into practice, we began implementing the methodology using our newly created chloroplast database. One of the goals of our modeling project was to investigate whether, based on the insights of the regulatory sequences contained in this database, we can build synthetic parts that have favorable properties in a broad range of different cell-free extracts.


For this, we specifically took the Integrated Human Practices feedback with Northwestern University to heart and focused on analyzing endogenous promoter sequences. One of the obvious candidates for this approach was the promoter of rrn16, our best new basic part. To investigate the evolutionary regulatory space between our working cell-free extracts in more detail, we decided to focus our search on genomes from the clade spermatophytes.

Figure 2: Chloroplast DB
Example query of getting all rrn16 genes from spermatophyta using our chloroplast database.

To make sure that the corresponding promoter sequence was indeed included in the data, we decided to export the 250bp region upstream of the gene start. Due to our novel, highly curated chloroplast database, we were able to export all 6,993 sequences of interest with just a few lines of code, which greatly accelerated this part of the project.

Steiner String

After having extracted all the relevant data, we set out to test our model on the creation of a so-called Steiner string sequence. The Steiner string is defined as the sequence s*, which minimizes the consensus error on the set of sequences

For this approach, it is necessary to train both an encoder f as well as decoder model g. The resulting autoencoder is then trained on sequences that are first encoded into the latent vector space and afterward decoded. For the loss function, we take into account both the edit distance approximation component as well as a sequence reconstruction one. The corresponding loss function is therefore given by the formula

where



is the edit distance loss and



is the reconstruction loss with hyperparameter alpha and the cross-entropy



In order to increase the robustness of to the decoder, Gaussian noise was added to the embedded points in the latent space before decoding it. After reparametrization to account for this trick, the reconstruction loss function is updated to



where epsilon is distributed normally with mean 0 and the variance as a hyperparameter.


For the implementation, we decided to use convolutional models (CNN), seeing as they have outperformed both feedforward as well as recurrent models in the original paper.


Figure 3: Steiner string approach
(Corso et. al., 2021) On the left, the training procedure using pairs of sequences. On the right the extrapolation for the generation of the Steiner string by decoding the geometric median in the embedding space.

This allowed us to significantly reduce the complexity of Steiner string search since, in the sequence space with the edit distance, finding the median point is a hard combinatorial optimization problem. However, in the space of real vectors with the distance functions used in this project, it becomes a relatively simple procedure that can be even be done explicitly in some cases. The Steiner string of a set of strings S can thus be easily approximated by


Testing

After deciding on the design of the autoencoder as well as deciding on hyperbolic embedding space, it was time to start the training procedure. For this, the roughly 7000 Prrn16 sequences were split into 4500 for training 700 for validation and 50 groups of 30 sequences for which the Steiner string was computed. The model was trained using the following parameters


  • distance = hyperbolic
  • normalization = 0.25
  • embedding_size = 64
  • layers = 3
  • alpha = 0.003
  • std_noise = 0.1
  • lr = 1e-05
  • weight_decay = 0.0
  • dropout = 0.0
  • epochs = 500
  • batch_size = 128

Analysis

The resulting consensus string was surprisingly similar to that of our N. tabacum Prrn16 promoter, differing by only one base pair over the course of its 64bp length.


Figure 4: Sequence comparisson
Comparing the Steiner string with the native tobacco Prrn16 promoter

Due to the fact that the underlying data might be skewed towards species that are closely related to N. tabacum, we decide to once again take a closer look into the sequence space and used NeuroSEED for multiple sequence alignments.


The resulting data revealed to us that the dataset can be roughly split into 2 clusters, one which strongly resembles Prrn16 from N. tabacum and the other one closely resembling those of monocots (e.g wheat, rice maize). However, since both clusters were surprisingly different, we attempted to test both the native wheat Prrn16 and the N. tabacum Prrn16 promoter in spinach.


Since the native spinach rrn16 promoter is very similar to that of tobacco, we assumed that it would also show stronger expression. To our astonishment, however, the opposite was true. Intrigued by this discovery, we took another look at our multiple sequence alignment results to study the conservation of nucleotides.In doing so, we discovered that the Prrn16 sequences of these two clusters significantly diverged in 5 regions of interest.


To Investigate what the effects of these regions were on the promoter expression, we synthesized 5 synthetic promoters based on the N. tabacum one, changing 1 region of interest at a time.We managed to clone all 5 sequences as LVL-0 Golden Gate parts, but due to the time-limited nature of iGEM, we weren’t able to test them before the wiki freeze.

Outlook

We believe that this model, in combination with our database, is a promising tool to gain insight into the vast amount of regulatory sequences. After the wiki freeze, we hope to continue our work on the prrn16 promoter and gain deeper insights into what elements are responsible for the discrepancy in expression. We believe that future applications may include both the creation of libraries for different regulatory elements as well as the creation of completely novel synthetic parts that are based on the insights gained via this model.