# Team:UPF Barcelona/Software AMine

Team:UPF Barcelona - 2021.igem.org

A modular, simple and flexible Python-based prokaryotic genomics software that does not depend on traditional sequence alignment but an efficient combination of word frequency-based methods and low dimensional clustering.

To be, or not to be resistant? That is not the only question here, but a fundamental one.

It is clear that antibiotic resistance is eminently complex by its nature, and that as knowledge about it increases, the number of possibilities and the multiple mechanisms involved becomes greater, especially regarding indirect interactions that may create the right conditions for resistance to emerge.

In this context, it could be valuable to analyze the biological system from a bottom-to-top perspective, focusing on how the connections between simple elements bring out the dangerous capacities of resistant bacteria. Thus, it may be worth using approaches that seek to find from scratch those threads to pull at first: what may be seating the bases for resistance?

That was the motivation to create AlphaMine. This is a modular, simple and flexible Python-based genomics software. To perform its task, AlphaMine does not depend on traditional sequence alignment but a much more efficient combination of word frequency-based methods and low dimensional clustering. The software can be easily integrated into computational pipelines, and it is intuitive to use and configure. It focuses on the autonomous analysis of large collections of prokaryotic genomes from a comparative perspective and relies on set theory operations.

AlphaMine is designed as a nested-class structure. Its core module, called Pangee, is the one directly manipulating the genomes, performing set operations, and constructing pangenomes. Then, the system is built on Pangee with the other parts, such as a preprocessor for data preparation, a command-line interface for better user control, or a managing mechanism to bridge between the higher-level instructions and the internal processes.

In the following sections, the entire development and engineering process are explained, from the conceptual framework and mathematical principles to the algorithmic design and software implementation.

### Posing the problem

Let us imagine that we want to identify the genetic elements related to a certain function in a family of species.

We have a subgroup of those species, $$R$$, in which we observe the functionality per se, while another subgroup $$S$$ does not show it. Subsequently, one way to find potential candidates would be to start isolating those genetic components that are only found in the genomes of subgroup $$R$$ and never appear in $$S$$ or vice versa. Why? Well, those that only appear in $$R$$ could be directly exerting the function of interest, while those that only appear in $$S$$ could be inhibiting the first: in one way or the other, the possibility of studying them could be enlightening.

Replace $$R$$ and $$S$$ with "Resistant" and "Susceptible" to a specific antibiotic, and the result of this approach will be a pool of genetic elements that may be supporting antibiotic resistance, by their action or absence. Consequently, this seems logical first step to understand, from bottom to top, this dangerous threat emerges. But, what is this "group of genetic elements characteristic of some type of organisms"? A collective super-genome? Be that as it may, something like this is necessary to make the approach possible.

### On the pangenome

Fortunately, genomics brings us a powerful concept that perfectly fits this key definition. The pangenome, or supragenome, describes the entire set of genes shared by a clade, a group of species, and it is a useful tool to unveil their insides beyond the traditional notion of "discrete", well-separated organisms. A complete pangenome can be divided into three different components: the core pangenome (genes shared by the entire clade), the shell pangenome (genes shared by two or more species in the clade), and the cloud pangenome (genes unique to one of the clade's species).

If you know the pangenome, you gain a deeper understanding of the studied species' true nature. In that way, to materialize the methodology mentioned before, it is necessary to build a system capable of constructing the pangenomes for different collections of bacterial strains and to properly operate with them, so the elements of interest can be finally isolated.

### Introduction to sets

To build a computational system that deals with the challenge presented above, it is necessary to re-express the task in formal terms. For this purpose, we can use set theory, a powerful branch of mathematics that defines a very useful abstract object: the so-called set. $S$

A set can be understood as a collection of other mathematical elements, the singletons, which can be geometric shapes, symbols, values, vectors, among others, even other sets. $S = \{a, b, c, ... \}$

Furthermore, the set theory also gives us the ability to manipulate sets and study the relations between them. This is possible thanks to some basic operations and properties. Here, some examples are shown.

Equality. Two sets are considered identical only and only if they contain the exact same singletons. $A=B$

Union. Produces an output set containing the entire collection of singletons present in the group of input sets. $A \cup B$

Intersection. Generates an output set with those singletons shared by each of the sets in a group of input sets. $A \cap B$

Complement. The output set is the one integrating all the singletons that do not belong to the input set. $A^{c}$

Subtraction. The output is a version of the minuend set without the singletons present in the subtrahend. $A-B$

### Formalizing sequences, genomes, and pangenomes

How can we express a gene or a DNA fragment as a mathematical entity? One way is to understand it as a symbolic sequence $$s_i$$: an ordered list of $$L$$ elements expressing one of the 4 basic units in the genetic code. $s_i = \left(b_1, b_2, b_j, ..., b_L \right) \text{\quad where \quad} b_j \in \{A, T, G, C \}$

Using set theory, we can now formalize a genome as a set made of a specific type of singletons: genes, or more generally, the DNA sequences previously defined, $$s_i$$. $G = \{s_1, s_2, s_i, ..., s_n \}$

Once reached this point, starting from a collection of genomes we can easily drive the complete pangenome with the union operation. $P = G_1 \cup G_2 \cup G_i ... = \{p_{1}, p_{2}, p_{j}, ... , p_{N} \} \equiv G_{p}$

Likewise, we can obtain the core pangenome by applying the intersection operation to the genome collection. $P_{core} = G_1 \cap G_2 \cap G_i ... = \{p_{core_1}, p_{core_2}, p_{core_j}, ... , p_{core_M} \} \equiv G_{p_{core}}$

Consequently, their relation can be expressed too. $P_{core} \subset P \qquad M < N$

In any case, the pangenomes are sequence sets too, just like genomes. However, for all this to be possible, there is a fundamental issue that should be solved. When are singletons equal?

### The equality problem

As it has been reviewed before, to compare sets it is necessary to distinguish if their singletons are the same or not. If that is not possible, the operations and properties explained will not apply.

When dealing with symbols, this is straightforward: axiomatically, symbols are simply the same or not. In our case, as we are working with sequences, we have to extend this principle to evaluate if the sequences do present the same elements in the same order. If both sequences have the same length, the solution is to check if, for each position, the elements found in both sequences are equal. $s_i = \left(b_{i_1}, b_{i_2}, b_{i_k}, ..., b_{i_L} \right) \qquad s_j = \left(b_{j_1}, b_{j_2}, b_{j_k}, ..., b_{j_L} \right)$ $s_i = s_j \text{\quad if \quad} b_{i_k} = b_{j_k} \text{\quad for \quad} k \in \{1, 2, ..., L \} ?$

But, is really that simple?

Well, with DNA, the pictures become a little bit more complex. In nature, genes are prone to suffer lots of variations and changes, although they still play a similar biological role. Thus, two genes can be categorized as equivalent de facto from a functional perspective without being exactly equal in sequence. In this context, for instance, two "equivalent" genes could have different sizes because of their evolutive divergence: one may have accumulated extra nucleotides, for instance. If the sequences to compare have different lengths, it is true that we still could make the measurement relational to the shorter one. $s_i = \left(b_{i_1}, b_{i_2}, b_{i_k}, ..., b_{i_{L_i}} \right) \qquad s_j = \left(b_{j_1}, b_{j_2}, b_{j_k}, ..., b_{i_{L_j}} \right) \qquad L_i > L_j$ $s_i = s_j \text{\quad if \quad} b_{i_k} = b_{j_k} \text{\quad for \quad} k \in \{1, 2, ..., L_j \} ?$

Nevertheless, there is no way to ensure that the alterations are concentrated in one sort of cleary-segregated, extra region. Most likely, these differences will expand across the entire sequence, as deletions, insertions, and single-nucleotide polymorphisms are common in biological systems. In that way, assuming that two genes are completely different just because there are mismatches in their sequences is simply incorrect.

Fortunately, these affairs have been a headache for scientists during the development of biology as we know it. Thanks to their effort, powerful concepts have emerged from an evolutionary perspective. These ideas help us understand the identities of genes and the relationship between them from a more realistic point of view, both at the level of origin and function.

Sequence homology is one of these concepts. When two genes are homologous, it implies or nucleic acids are similar to each other because they have the same evolutionary origin. Homology is expressed as a magnitude whose value depends on the strength of the hypothetical relationship studied. Likewise, when analyzing different species inside a clade, we can also find orthologous genes. These are homologous sequences that originate in a speciation event: a separation of a common ancestor into two distinct species, with two different copies of the original sequence that diverge in the future path. Accordingly, orthology is specially imporant for our problem.

In this context, a strong homology between sequences can be correlated with a common role, and strong orthology may imply that two genes act as functional counterparts in two related species. Subsequently, we can define the concept of a group of sequences that, despite being different and found on different organisms, are likely to be highly related, coming from a common ancestor that provided them with an alike biological role .

This has a name, it is called a gene family. As in our problem we are interested in what the DNA sequences are doing inside their biological systems, we can use this definition to solve our issue: two sequences will be shared de facto by two genomes if they are homologous enough to be considered in the same gene family.

But how is homology measured? Mostly, looking at how similar the sequences are. In that way, sequence similarity is a result that comes from observation, while homology is a specific interpretation of such measurement (high similarity between two sequences is explained by common origin). This does not have to be true always, of course, as short sequences may be very similar just by chance, but with the lengths considered in our approach, and if the similarity is high enough, it is very likely to be significant.

Therefore, the problem of determining equality turns into a task of computing and evaluating similarity. When comparing two sequences in distinct genomes, we will assume a common identity to exist if the result of some "similarity function" goes beyond a certain pre-defined threshold. $s_i = s_j \qquad \equiv \qquad similarity(s_i, s_j) \geq \theta_{similarity}$

By imposing this rule, we can finally do what we needed: applying intersections, unions, and derived operations to sequence sets. $G_{A} \cup G_{B}, \quad G_{A} \cap G_{B}, \quad G_{A} - G_{B}...$

Now that the fundamental pieces are ready, it is time to play.

### Introduction

As explained in the rationale, the idea of all this approach is to analyze a collection of bacterial strains: by comparing the ones that resist the antibiotic of interest and the ones that do not, the goal is to isolate what genetically differentiates them for further analysis. In the following subsections, a series of generalized algorithms associated to this task is defined. They are focused on implementing the basic set operations studied before, on the construction of all the types of pangenomes described and finally on how to apply all the concepts to solve the task.

### Construct any type of pangenome

The process is practically identical when computing the complete pangenome and the three components mentioned above: core, shell, and cloud. There are some minor details changing, and different strategies to minimize the workload, but this makes possible to design the approaches needed to compute the four types of pangenome.

Start with a group of $$N$$ genomes, the $$library$$. $library = G_{1}, G_{2}, G_{i},..., G_{N}$

Define one of them as reference, and eliminate it from the $$library$$. $G_{ref} = G_{i} \qquad library = G_{1}, G_{2}, ..., G_{N-1}$

Depending on the desired type of pangenome, the specific choice and subsequent steps diverge.

## Complete pangenome

Select as reference the genome with the highest number of sequences. $G_{ref} = max\left(G\right)$

Apply the following iterative structure.

$\text{For } G_i \text{ in } library \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$

$\text{For } s_{i_k} \text{ in } G_i \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$

$\text{For } s_{{ref}_j} \text{ in } G_{ref} \quad\qquad\qquad\qquad\qquad\qquad\quad$

$\text{If } similarity(s_{{ref}_j}, s_{i_k}) < \theta \quad$

$\quad \qquad \text{Add } negative \qquad$

$\text{If } negatives \text{ equal to } G_r \text{'s lentgh}$

$\qquad \qquad \qquad \qquad \qquad \qquad \text{Include } s_{i_k} \text{ in } G_{selected} \text{ and break this loop}$

For each sequence of $$G_{i}$$, it is checked whether there is an ortholog in $$G_{ref}$$. When there is not, a negative is counted. If after finishing the loop there are as many negatives as elements in the reference set, it means that the sequence in question is new, so it is added to $$G_{ref}$$. On the contrary, if there are fewer negatives, this implies some orthology is detected, so the sequence is already present. In that way, with more and more iterations, the size of the reference set becomes larger. Thus, at the last iteration, the resulting set will be a pangenome that contains all those sequences that appeared in each of the input genomes. $G_{ref} = P_{complete}$

## Core pangenome

Select as reference the genome with the smallest number of sequences. $G_{ref} = min\left(G\right)$

Apply the following iterative structure.

$\text{For } G_i \text{ in } library \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$

$\text{Create an empty genome } G_{selected} \qquad\qquad\qquad\qquad\;\;$

$\text{For } s_{{ref}_j} \text{ in } G_{ref} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$

$\text{For } s_{i_k} \text{ in } G_i \qquad\qquad\quad\qquad\qquad\qquad\qquad$

$\text{If } similarity(s_{{ref}_j}, s_{i_k}) > \theta$

$\qquad \qquad \qquad \qquad \qquad \qquad \text{Include } s_{i_k} \text{ in } G_{selected} \text{ and break this loop}$

$\text{Update } G_{ref} \text{ to be } G_{selected} \qquad\qquad\qquad\qquad\qquad\qquad$

Each time that a sequence of $$G_{ref}$$ appears to have an ortholog in $$G_{i}$$, it is included in $$G_{selected}$$ and the search for that genome stops. If that is not the case, the sequence is excluded. In this case, when updating $$G_{ref}$$, the size of the reference set becomes smaller. Thus, at the last iteration, the resulting set will be a pangenome that contains only those genes presented orthology across the entire collection of input genomes. $G_{ref} = P_{core}$

## Shell pangenome

Select as reference the genome with the highest number of sequences. $G_{ref} = max\left(G\right)$

Apply the following iterative structure.

## Cloud pangenome

Select as reference the genome with the highest number of sequences. $G_{ref} = max\left(G\right)$

For each of the genomes in the $$library$$, apply the following iterative structure.

### Find the resistome and the vulnerome

Start with two groups of genomes: resistant and susceptible to the antibiotic. $G_{R_1}, G_{R_2}, G_{R_i},..., G_{R_N} \text{\quad and \quad} G_{S_1}, G_{S_2}, G_{S_i},...,G_{S_M}$

To avoid excluding any potential candidate, first construct the complete pangenome for each class, which means to compute the union of all the genomes belonging to the group in question. $P_R = G_{R_1} \cup G_{R_2} \cup G_{R_i} \cup ... \cup G_{R_N} \text{\quad and \quad} P_S = G_{S_1} \cup G_{S_2} \cup G_{S_i} \cup ... \cup G_{S_M}$

Once both pangenomes are obtained, compute the intersection between them. $P_R \cap P_S$

Then, on the one hand, subtract the intersection to the resistant pangenome $$P_R$$.
The result will be the resistome $$R$$. $R = P_R - P_R \cap P_S$

This is the set of all sequences that only appear on resistant strains. $R = \{r_1, r_2, r_i, ..., r_n \}$

On the other hand, subtract the intersection to the susceptible pangenome $$P_S$$.
The result will be the vulnerome $$V$$. $V = P_S - P_R \cap P_S$

This is the set of all sequences that only appear on susceptible strains. $V = \{v_1, v_2, v_i, ..., v_m \}$

Having obtained $$R$$ and $$V$$, the task is solved.

### State-of-the-art in sequence comparison

So far, we have defined general algorithms that allow the construction of the different types of pangenomes, and operate with them to obtain the information of interest from the input genome collections. However, all these approaches are based on an element yet to be defined, without which it is impossible to perform the calculations: the similarity function.

As previously discussed, the highly variable nature of DNA sequences makes analyzing their relationships not trivial. Traditionally, the similarity between DNA sequences has been estimated using alignment techniques. These methods, based on dynamic programming, have been very useful, but they are computationally very heavy, produce large errors in certain situations, and can become impractical at certain scales, simply because of how they are made.

### Alignment-free methods

In the context of this project, thousands of comparisons between DNA sequences have to be made for each genome, so saving as much time as possible in each operation is of vital importance to make the process feasible. In fact, this is becoming a general problem in genomics as sequencing technologies become brutally powerful, and the amount of data is increasingly overwhelming.

It is precisely this situation that is driving the development of a large group of alternative sequence comparison methods, which dispense with traditional methodologies. To achieve this, these techniques focus on exploiting an indirect measure of information, in its purest sense, that exists in DNA chains. They are the so-called free alignment methods. The latest studies show that at the level of efficiency they are much higher than their traditional counterpart. In addition, it is also being seen how they are capable of surpassing them also in accuracy, being able to even bypass alignment's weak points.

Although there are many alignment-free methods, with their unique properties, a basic division can be established between two distinct groups: word-based and information theory-based methods. The first one is based on studying the frequencies for which subsequences of certain lengths do appear, meanwhile the latest focus un estimating what intrinsic information do the sequence share.

### Defining a word frequency-based comparison algorithm

In the context mentioned above, below we propose an algorithm to compare sequences of the word frequency-based type. This will focus on reducing the sequences to be compared to two vectors of the same size, and calculating the angle to separate them, from which the similarity per se will be estimated.

We start from two sequences that we must compare.

First, we must define a fixed size k for the words that we are going to create, the kmers.

In the literature, there are fundamental mathematical relationships that allow us to estimate the optimal k from the size of the sequences that we are going to analyze, so we can leave this parameter defined.

Having already done this, we generate for each of the sequences its own set of kmers. This consists of an iterative process of going through the positions of the sequence and taking the following k-1 elements to form all possible kmers.

Having the derived set of each sequence, we calculate the union between both, and thus we obtain a new set.

The total number of kmers that could exist in both sequences is contained in this new set. Taking each of its elements, and counting how many instances of these exist in both sequences, we will have created two vectors that represent the frequencies of the kmers. As the order has been defined by the union vector, which is unique, both vectors will be consistent.

At this point, we simply have to calculate the cosine distance between both vectors, and multiply the number obtained by 100. The result will be the estimate of the percentage of similarity between both sequences.

### Implementing the kengine function

Having already defined a way to do the calculation, the next step is to write a small function that executes the process as efficiently as possible. In addition, it is important that it be easily integrated into other mechanisms to facilitate comparisons. Therefore, we have sought to incorporate the least number of external dependencies.

To build a computational system that deals with the challenge presented above, it is necessary to re-express the task in formal terms. To achieve this, we can use set theory, a powerful branch of mathematics. This offers us an object that can be of great help: the so-called set.

Let us imagine that we want to identify the genetic elements related to a certain function in a family of species. We have a subgroup of those species, A, in which we observe the functionality per se, while another subgroup $$S$$ does not show it. Subsequently, one way to find them would be to isolate those genetic components that are only found in the genomes of subgroup $$R$$ and never appear in $$S$$ or vice versa. Why? Well, those that only appear in $$R$$ could be directly exerting the function of interest, while those that only appear in $$S$$ could be inhibiting the first: in one way or the other, the possibility of studying them in greater depth could be enlightening.