Identifying to which class every sequence belongs to is a key part in our pipeline. To achieve that, we have created a series of models based on Convolutional Neural Networks (CNNs) which are able to classify gene sequences based on their class and mechanism of action.
Introduction
Given that resistance is a problem with several layers of complexity involving evolutive pressure and individual environmental situations, our attempts to isolate its mechanisms must also contain several lines of action. Initial attempts of predicting resistance to antibiotics directly from gene sequence were made, however, the results obtained were not as accurate as expected. To improve the initial pipeline, we thought to classify according to the mechanisms of action. In addition, promiscuity and virulence classifiers models were created to give a denser reach to our software. Finally, we adapted the original model to classify between the 3 mentioned classes (Resistance, Virulence, and Promiscuity) to improve general accuracy. The general pipeline can be seen in Figure 1.
In the following sections, you will find the theory underlying our methodologies and an explanation for each of the subsystems we have been developing.
Background
Neural networks are a relatively recent technology that is based on the mimicking of biological functioning of the brain and nervous system. Specifically, they simulate their electrical activities. A neural network consists of a set of nodes that emulate neurons fully interconnected with each other with a connection called weight, which has a modifiable value associated. They are typically arranged in layer or vector configurations, with the output from ones being the input to forthcoming others. All these input/output relationships represent the connections that would take place in brain synaptic connections.
To understand how neural networks work, it is important to understand that each node has a modifiable value called bias and also has connections with nodes on the layer before and the layer after it. These connections have a weight associated with it. These 2 parameters are modified through the iterations of the network. The more important or useful the feature being transmitted, the higher the weight and bias associated with it is. On the other hand, connections that provide little to no information have a way lower or even negative value. This update of the aforementioned weights and biases is done via the employment of gradient descent, which has a loss function calculation. Both mechanisms are explained more in detail below.
Types of learning
Training algorithms are mainly classified into three groups:
Supervised learning
These algorithms require a labeling of the data. Essentially it consists of adding a small description of what the data is. In Images, for example, that label is a number corresponding to certain parts of the image while in DNA sequences might be its function. The Neural networks fall in this category, therefore, we need labeled data.
Unsupervised learning
In unsupervised algorithms there is no labelling. Instead, the algorithms find inner patterns and aggregations of the data by looking at the different features. This can be useful to discover new ways to organize the data.
Reinforcement learning
The algorithms that fit here are based on trial and error behavior by computing a constant trade-off of the knowledge it has and the one it doesn't possess. The system learns by reward on finding the correct solution.
Gradient Descent
As explained before, gradient descent is the mechanism employed to update the weight of our neural network. Hence, it is what is called the optimization algorithm. The procedure this algorithm follows is divided into the next steps:
First, a set of inputs is given to the network, which we will call a batch. This batch is processed, and the network generates what we call predictions, where it predicts the label associated with each sample. Then, the loss function is evaluated. This function is personally selected before this process is started. It is selected according to the type of problem addressed, and its purpose is to calculate the difference between the predicted labels and the actual labels associated with each sample. This function can also be viewed as a landscape, with its minimums being our objectives.
At this point, the derivative of the said loss function is calculated. Graphically explained, it represents the slope of the tangent to the function. In other words, it is a vector that gives us the orientation and the direction in which the function increases faster. Since the purpose is to minimize a said function, we will move in the inverse direction. With this information, the Backpropagation is able to perform the weight update.
After the whole process (including Backpropagation) is terminated. The parameters are saved and the next batch of data serves as input, re-starting the whole process with the weight values that the last one finished with. It is, then, an iterative process until all the data is finished. But even then that colossal calculation is only one of the dozens epochs of the network, that is why they need a lot of computational resources and time
Backpropagation
The backpropagation algorithm is one of the most important blocks of the neural network. It plays a role inside the Gradient Descent, which without Backpropagation has the flaw of not being able to perceive direct changes in the first layer’s weights since those affect the behavior of the total network. The Backpropagation proposes a paradigm where the cost function is minimized by adjusting both the weights of the connections between nodes and the neuron's biases.
The amount of variance for the parameters depends only on the gradient of the loss function with respect to those parameters, understanding the gradient as the vector of partial derivatives for all parameters to be accounted for.
But why compute the derivatives? Because they give us the direction where the function is changing, the ratio of change depending on the parameters. Since the network can see the change of the function, it can calculate how much we need to change the parameters to bring the loss function to the minimum value possible and therefore minimizing the error. But the change in parameters does not happen suddenly, it is performed by steps of a determined size, which we call learning rate (see Parameters section).
The gradients are calculated first in the last layer of the network since they do not have any connection with further layers. Once the last layer is finished, the algorithm uses the chain rule of calculus to assess the gradients in the prior layer. Simply put, the rule allows us to know the gradients on the past layer using the gradients of the present layer. That process is performed iteratively until it reaches the input layer.
By the end of the process, all weights have been updated for optimization, including favored connections and punished ones.
Despite its apparent success, so a wrong combination of parameters can lead to the failure of arriving at the global minimum. One of its main drawbacks is that it can get stuck in a local minimum or a suboptimal solution. This is not that bad when the local minimum is located close to the global one, but there is no guarantee that this is always the case.
Parameters
Batch Size
The size of the sub-groups that are created from the total data to begin the Gradient Descent process.
Epochs
A single epoch is completed when the whole dataset is put through forward and backward all the way down the neural network only once. The number of epochs indicates the amount of times the process is repeated.
Learning Rate
The learning rate marks the step of parameter change in Gradient Descent, concretely in Backpropagation. This numerical value (usually very small) marks the maximum change of parameter value at each instance of the gradient. Big learning rates often tend to move the function too much and miss the global minimums while small learning rates have eternal computation times.
As the number of epochs gets higher, more times weights get modified in the neural network so that the curve can mature from underfitting tendencies towards optimal overfitting curves. Deciding which value of parameters to use is a key aspect when constructing a neural network.
First layer of action
The Main Discriminator was created as a filter for the next models. Instead of classifying according to the mechanisms of action, it separates all the input sequences in the 3 classes that we considered core for solving our problem with the addition of a False class represented by essential genes to account for sequences that did not fit to the first three and avoid misclassifications. Nevertheless, adding the Essential class had a prize, lowering the overall accuracy of the system and the Virulence metrics in particular (see results). Due to this trade-off situation, we advanced 2 versions of the Main Discriminator, one with the False class (and 4 classes in total) and the other with the 3 main classes only. Overall, both models, and especially the 3-class one proved to be a very positive asset to the general pipeline, as it helped us discriminate information from the start and avoid misclassifications later.
Input Data
All the data used for the model was downloaded from several open access databases such as PATRIC, CARD, ICEberg, DEG and VFDB [1][2][3][4][5]. The raw data presented several problems. On the one hand, since we used different sources, the format was slightly different. While all of the files were in FASTA (text based), some of them organized the sequence in amino acids while others in nucleotides directly, forcing us to convert them. Since the original tests were developed in codons, we decided to keep working with them.
On the other hand, the data often contained ambiguous labels. To solve this problem we investigated other sources such as GeneBank or NCBI [6][7] in order to find the function of the sequences with labels we couldn’t identify. This way, each of the sequences was described by a label indicating their class.
Once the data was ready, it was re-organized in a CSV (Comma-separated value) with each row containing the class of the sequence followed by the sequence itself. This format allowed us to read the data in an efficient way. The original input file had 11556 labeled sequences.
Structure
Before being able to input the sequences in the CNNs, we need to transform the sequences into numbers, given that CNNs are based on complex mathematical behavior. We achieved that by tokenizing the data: we use a dictionary where a number is related to each codon. Since there are 64 possible codons, our sequences are transformed into numbers ranging from 1 to 64. A graphical representation of this process can be seen in Figure 6.
Another important criterion to be met is that all sequences must be of the same length in order for the CNN to work properly. Therefore, we apply a padding of the inputs. This process consists of adding as many zeros at the end of each sequence until a fixed size is reached, that size corresponding to the length of the biggest sequence. In the case of the main classifier, that number was 7680. In addition, labels must also be transformed from the class name (‘Promiscuity’ or ‘Resistance’, for instance) to numbers. This transformation is slightly different: instead of assigning numbers, we transform the label to a simple array of 3 or 4 positions (depending on the version used). So, each sequence has a label made of a 4-positional array containing zeros at all positions except at the sequence’s class, which contains a 1. This process is known as label binarizer. The order of the positions was fixed to Essential, Promiscuity, Resistance, or Virulence.
Once everything was transformed, we proceeded to input 80% of total data to the CNN. The reason for this is that we want to leave a small set of data for evaluating the classification performance and that must be done on data that has not been used for training purposes.
About the CNNs architecture, summarized in Figure 9, its structure follows a standard construction. The first layer of every CNN that works with texts needs to be an embedding layer. This one is typically known as the first hidden layer of a network. As such, its main purpose is to convert each input given (a number representing a word) into a vector of fixed size as represented in Figure 7. The input dimensions of such a layer is the size of the vocabulary used. In our case, as explained earlier in the process, there’s a total of 64 elements present in our vocabulary, which make up for the number of possible codons. The output dimension is 3 or 4 (depending on the version used), which represents the aforementioned length of the vector for each word.
The main body of the CNN is based on functional blocks coupled in succession. These blocks contain 3 different operations: A convolutional layer with a ReLu activation function, a MaxPooling layer and a Dropout rate. The next layer is a Bidirectional operation. That process allows the output to receive information from the first blocks of the network and therefore, generate a more complete output. Finally, we need a layer that is able to compress all the information that has been extracted in the whole network to a format we can interpret. This was achieved by using the Dense layer with a softmax activation function. The Main Discriminator incorporated 5 functional blocks like the ones in Figure 9
The training of the Main Discriminator was conducted during 15 epochs, the loss function was categorical cross-entropy, we selected ‘adam’ as our optimizer and accuracy as the metric to show while training. Each training lasted 5 hours approximately.
After the model is trained, we can produce a prediction with new sequences. This process outputs a 4-dimensional array where the position for each class contains the probability of belonging to the particular class. Using a threshold of 0.5, we consider any class with a higher probability to be the sequence’s label. Therefore, each sequence can only have one predicted label.
Gene Transfer Identification
Horizontal gene transfer is one of the mechanisms that bacteria have at their disposal to share genetic information between genomes. Nevertheless, this term includes a wide spectrum of more specific machinery designed for such purposes, which includes transformation, which results from the uptake and expression of foreign DNA, and conjugation, which consists of a DNA transfer from a cell that acts as a donor to a recipient one.
Input Data
In order to train our CNN for this purpose, a good deal of data was required. Such an amount was adapted from ICEberg, a database specialized in conjugative elements. The data downloaded from the said website was coded in proteins, which meant that an additional step of reverse translation had to be performed since our purpose was to train our neural networks with sequences of codons. Since more than one codon can code for the same protein, a probabilistic approach was taken to obtain the sequences. Furthermore, since the labeling of such data was ambiguous for our objective, we adapted them to the desired categories of mechanisms that we considered of importance. Afterward, the data was organized in a CSV file following the structure of the data destined for the main discriminator. The total number of elements in the input file was 3066 sequences with their corresponding label. For the training of the CNN, 80% of the data was destined for training, while the other 20% was employed for test purposes.
Structure
The structure of the CNN employed for this particular training followed the same form as the one described in the main discriminator. Nevertheless, a change in order to adapt it to our specific case was performed in the Embedding layer: its output dimensionality, which was previously set to 3 (or 4 depending on the version), has been changed to 2. That is due to the fact that our main purpose is to discriminate between 2 different classes, which are ‘conjugation’ and ‘transformation’. The Promiscuity classifier incorporated 5 functional blocks like the ones in Figure 9
Recongnizing the Risk
Virulence factors define the ability of bacteria to infect and/or damage a host cell. These virulence factors are encoded as genes in bacteria genomes. Therefore, we believed that including a model for detecting and classifying these pathogenic elements could give robustness to the overall classification network. We decided to work with two types of virulence factors: toxins, which produce damage to the host tissue and can disable the immune system; and adhesion factors, which allow pathogens to attach themselves on the host cell.
Input Data
The data set used to train the CNN was acquired from the virulence factor database (VFDB). This database contains up-to-date information about pathogens, including a collection of genes that have an impact on the virulence properties of bacteria. In this case, the genetic sequences were directly found in the form of base pairs. On the other hand, every gene contained a brief description of the protein they encoded for. Therefore, by checking manually the name and information of each of the genes, every sequence was provided with a class label that indicated if they were toxins or adhesion factors. Since the classification held is binary, only two different labels are needed: ‘Toxin’ and ‘Adhesion’. In the same fashion as in the other classifiers, the data set was organized in a CSV file of 1729 sequences with its respective label.
Structure
Regarding the architecture of this network, no changes with respect to the CNN used in the Promiscuity classifier were made, since the output dimensionality of the Embedding layer is also 2 because a binary classification between the ‘Toxin’ and ‘Adhesion’ classes is being made. After performance issues, the number of functional blocks for the Virulence network was changed from 3 to 5.
The Devil is in the Details
As we know, Resistance stands for the ability of pathogens to become immune to certain antibiotics. That process is carried out using multiple resistance mechanisms such as the efflux pump or altering the target, among others. Thanks to articles in literature [8][9], we know that certain combinations of mechanisms are used to develop resistance to certain antibiotics. Using this information, we created a dictionary where antibiotics are related to the mechanisms used to resist them (e.g, beta-lactams are resisted by hydrolysis, efflux, and altering the target). The full dictionary contains 9 total labels and can be seen in Figure 8.
Input Data
All the data used for the resistance classifier came from PATRIC [1], which contains a wide variety of information regarding antimicrobial resistance and genomics from numerous bacteria. The database is organized in antibiotics and genes that resist them. We made initial filtering to select only those antibiotics that had more than 100 genes of resistance. Also, we discarded those genes that had a length higher or lower than the mean length plus two standard deviations.
Once that filtering was complete we organized the data in the initial CSV’s from the very first engineering cycle. From there, we iterated through all the files and converted the antibiotic name to its mechanisms using the dictionary while leaving the actual sequence intact. That way, we had a whole CSV containing the antibiotic mechanisms of each sequence. The sequences were in nucleotides, so no further pre-processing steps had to be made in that regard. The final CSV contained 4790 labeled sequences.
Structure
The Resistance classifier behaves differently than the ones seen until now. All the previous networks were single-label, meaning that the sequences could only belong to one class. Now, a sequence can fit multiple labels, turning the problem into a multi-class, multi-label classification.
In terms of the tokenizer of sequences, there is no change, the same dictionary seen in the Main Discriminator is used to convert sequences into numbers. It is important that the dictionary is the same, given that the two classifiers have to be coupled.
The labels, however, must be treated slightly differently. Instead of a label binarizer we used a multi-label binarizer, which converts the labels in a 9-positional array. The labels of each particular sequence are turned into ones while the others remain at 0.
Regarding the structure of the network, it has to be slightly modified. The blocks and layer organization remain the same, only changing the input size of the labels to 9. Nevertheless, multi-label problems require changes in activation and loss functions. First, the Dense layer (or output layer) was modified from a softmax activation function to a traditional sigmoid. Second, we changed the loss function from categorical cross-entropy to binary cross-entropy. These changes are applied so each of the labels is treated individually as if it was a binary problem between each class and the rest. This helped us boost the accuracy and precision metrics since the mistakes are punished much more than with the original configuration.
Data Post Processing
Having the Resistance mechanisms is really useful for the rest of the pipeline but in order to assess the results of the model, working with them is counterproductive. To solve this problem, we can use the same dictionary from Figure 8. This time however, we reverse translate from the mechanisms that have been found for each sequence to the antibiotic name. Apart from facilitating the results comparison (the original labels had also the antibiotic name), it also serves as a first check of quality given that for a reverse translation to be successful, the exact combination of labels for each antibiotic has to be predicted. If the system fails to create these combinations, improvements for accuracy have to be introduced. That was not the case however, since that only happened in 3 out of the total sequences.
References
[1] Davis JJ, Wattam AR, Aziz RK, Brettin T, Butler R, Butler RM, Chlenski P, Conrad N, Dickerman A, Dietrich EM, Gabbard JL, Gerdes S, Guard A, Kenyon RW, Machi D, Mao C, Murphy-Olson D, Nguyen M, Nordberg EK, Olsen GJ, Olson RD, Overbeek JC, Overbeek R, Parrello B, Pusch GD, Shukla M, Thomas C, VanOeffelen M, Vonstein V, Warren AS, Xia F, Xie D, Yoo H, Stevens R. The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities. Nucleic Acids Res. 2020 Jan 8;48(D1):D606-D612 https://doi.org/10.1093/nar/gkz943
[2] Alcock et al. (2020). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database Nucleic Acids Research, 48, D517-D525. https://doi.org/10.1093/nar/gkz935
[3] M. Liu, X. Li, Y. Xie, D. Bi, J. Sun, J. Li, C. Tai, Z. Deng, H.Y. Ou (2019). ICEberg 2.0: an updated database of bacterial integrative and conjugative elements. Nucleic Acids Research, https://doi.org/10.1093/nar/gky1123
[4] Hao Luo, Yan Lin, Feng Gao, Chun-Ting Zhang and Ren Zhang. (2014). DEG 10, an update of the Database of Essential Genes that includes both protein-coding genes and non-coding genomic elements. Nucleic Acids Research, 42, D574-D580. http://origin.tubic.org/deg/public/index.php
[5]VFDB: Virulence Factor Database http://www.mgc.ac.cn/VFs
[6]Gene Bank https://www.ncbi.nlm.nih.gov/genbank
[7]National Center for Biotechnology Information https://www.ncbi.nlm.nih.gov
[8] Reygaert W. C. (2018). An overview of the antimicrobial resistance mechanisms of bacteria. AIMS microbiology, 4(3), 482–501. https://doi.org/10.3934/microbiol.2018.3.482
[9] Davies, J., & Davies, D. (2010). Origins and evolution of antibiotic resistance. Microbiology and molecular biology reviews : MMBR, 74(3), 417–433. https://doi.org/10.1128/MMBR.00016-10