Team:UPF Barcelona/Alpha results

Team:UPF Barcelona - 2021.igem.org

Results

AlphaNeuro

Performance metrics definition and confusion matrices for the AlphaNeuro module.

Metrics used
Results

Definitions

In order to evaluate the performance of the Neural hallmark classifier, we assessed each of the network’s stages using a test set. The test set is composed of sequences that haven’t been used during the process of training. Therefore, the weights of each layer is not influenced by the information hidden in the sequences of the test set. Since each of the CNNs constituting the Neural hallmark classifier is trained using different data, each of the test sets is composed with sequences according to the network they are evaluating.

The general process of validation starts by inputting the respective test set to their CNN and generating a prediction for each of the sequences in the set. Since we know the real class of each sequence, we then need to compare if the output given matches with its true label in order to evaluate how well the system is able to predict unseen data. Then, different metrics can be calculated so as to assess the rate of success. We mainly used accuracy as the validation metric, since it generally describes how the model functions across the different classes. It can be calculated as the ratio between the number of correct predictions and the total number of predictions.

**Figure 1:** Traditional Accuracy formula.

Other metric used to assess the performance were recall and precision. To calculate them, we must understand that a True positive is a correct classification of a Positive label. For our case, if a sequence is predicted to be a ‘Toxin’, and the real label is ‘Toxin’, it will be a True Positive. A False Positive would be a sequence predicted to be a ‘Toxin’, but its true label is ‘Adhesion’. Similarly, if a sequence is predicted to be something other than ‘Toxin’ but the truth is ‘Toxin’, that constitutes a False Negative. Finally, a True Negative arises when both the prediction and the label are anything different than ‘Toxin’. Knowing this, the formula to calculate precision and recall are the following:

**Figure 2:** Traditional Recall formula.

**Figure 3:** Traditional Precision formula.

The results of this section will be presented using confusion matrices. This type of plots allow a quick analysis of the predictions made by displaying in a table the normalized number of correct and misclassified outputs. The name of the true label can be found on the vertical axis, while the name of the prediction is found in the horizontal one. Therefore, along the main diagonal we can find the number of correctly predicted sequences for each class; and outside of said diagonal are represented the number of confused predictions. We also included the accuracy value of each of the models.

**Figure 4:** Representation of structure of confusion matrix.

Main Discriminator

The Main Discriminator is the first layer of classification inside AlphaNeuro and serves as the model to discriminate the principal features of the sequence, which means that a correct performance of the classification system is essential for the whole pipeline of the project. As explained in the Software section of AlphaNeuro, there are 2 different versions of the Main Discriminator based on the trade-off between adding a "False" class and making a more complete analysis and overall better performance metrics. Since there wasn't a clear reason to choose one of the two, both versions were explored.

In Figure 1 we can observe the results obtained for the 3 class version of the Main Discriminator. The 3 features classes have a recall, precision and accuracy score higher than 90% which can be considered pretty good in classification problems. Regarding the heterogeneity of the data, we can conclude that it didn't play a major role. As explained in the software section, Resistance is the bigger class with 4791 sequences, followed by Promiscuity with 3068 sequences and finally Virulence with 1729. However, in the results the metrics are extremely close to one another (0.93,0.92 and 0.99 respectively), with Promiscuity spiking to a near-perfect scenario. This can be explained by the overall complexity of the process and patterns that the network is finding. Since we are using the same parameters for the 3 classes, one reasonable explanation is that Promiscuity traits are easier to recognize in DNA than the other 2. Studying the missclassifications, the biggest are obviously between Resistance and Virulence which is important to remember since it is amplified in the other instance of the network.

Overall, this version of the Main Discriminator can be considered a great first line of classification given that the error that will be carried out to later models is considerably small. Despite that, there is a big flaw that arises when using this configuration. If a sequence that does NOT belong to neither of the proposed classes is entered in the classification, it is likely that it gets classified to one of the considered ones, which is a mistake. Several performed tests showed that unrelated sequences had a tendency of being missclassificated with a higher prevalence on Virulence and Promiscuity.

To account for this problem, the 4-class version of the Main Discriminator is created adding a "False class" conssiting on bacterial essential genes:

**Figure 2:** Confusion matrix for the Main Discriminator (4 Class model). The metric present in the cells is the normalized recall

As we can see in Figure 2, the cost of having a False class is the drop of performance metrics in the other classes. While Promiscuity and Resistance either remain unchanged or fluctuate slightly, Virulence suffers a sudden drop in all the considered metrics. The reasons behind it is that the confusion between Virulence and the previous classes is amplified up to a 0.17 and there is a enormous missclassification regarding the new False class which goes in both ways. This error can be explained from multiple perspectives (or a combination of the two): On the one hand, the heterogeneity of the data might arise on this situation more than the previous model (the number of sequences for each class remains the same, with the addition of 3000 essential genes), which creates an unbalance not favorable for the correct classification of Virulence.

On the other hand, there could be a similarity between the essential genes pattern information and the Virulence one, which originated the missclassification.

Despite these performance issues, the Essential class has a notable good recall of 0.85, which means that now unrelated genes will be safely kept out of the rest of the classes that will be analyzed later. That is the reason why the 4-class classification might better of for the future, preventing False Positives is a very important feature of a great classification system. With the correct improvements in parameters and data, we could potentially have metrics similar to the prior model but also having the certainty thet the shield for non-interesting genes is up.

Promiscuity model

**Figure 3:** Confusion matrix for the promiscuity classifier featuring the normalized recall.

The metric present in the confusion matrix is recall, which gives the ratio of correctly identified labels to the total number of that label that should have been identified. As we can observe in the confusion matrix in Figure 3, the overall recall of the promiscuity classifier is near perfect, correctly classifying 99% of the ‘Transformation’ sequences while also classifying 95% of the ‘Conjugation’ labelled sequences correctly. With this we can affirm that even though we are employing heterogeneous data in order to successfully create a multi-class classifier, the CNN is capable of identifying repetitive sequences and similar structures among sequences of different bacteria of the same label, allowing it to differentiate them from the other sequences, which is one of the main of strengths of neural networks, being able to identify underlying patterns which could otherwise not be identified by humans.

Nevertheless, other metrics were employed to further test the efficiency of this model. The accuracy on the test set was 98%, with its formula being explained in the first part. Furthemore, precision was also evaluated, which calculates the ratio of correctly identified labels to the total of elements predicted to have that label, both correctly and incorrectly.

Virulence model

**Figure 4:** Confusion matrix for the virulence classifier featuring the normalized recall.

As it can be observed in the confusion matrix shown in Figure 4, the virulence classifier does not display an accuracy as high as the other models. Although the ‘Adhesion’ sequences are mostly well classified, the network is not able to achieve a rate of success higher than 59% for the ‘Toxin’ labeled sequences. This could be due to the high heterogeneity presented among the toxin genes, as many of them are unique and the model is not able to distinguish a clear genetical pattern among toxin sequences. On the other hand, the adhesion genes, which most of them encode for proteins forming pillis, fimbriae, flagella or adhesion factors, share mutual information in their base pair sequences.

Therefore, we can conclude that there are conserved architectures among different bacteria and structures.

A possible approach for improving the accuracy and performance of the virulence classifier could be by using a more extended and perfected data set. This way, the system could be trained with enough sequences to recognize deeper patterns and produce more curated outputs.

Resistance model

As explained in the software section, the Resistance classifier treats each class individually. In terms of results that means that a binary confusion matrix is generated for the assessment of each label. Since checking 9 matrices individually would not give relevant information on the type of antibiotic each sequence resists, we implemented a postprocessing (See software AlphaNeuro. Thanks to it, we can assess directly the metrics for each antibiotic instead of mechanisms, creating the 9x9 matrix in Figure 5, as in the other classifiers, the metric displayed is the Recall.

**Figure 5:** Resistance confusion matrix for all antibiotics featuring the normalized recall

The total number of sequences in the test set was 958. Due to the limitation in the data, we were not able to create a balanced distribution of sequences across all cases, for instance, Rifampin only has 12 sequences in the test set while Betalactams has 618. This difference makes it difficult for the network to correctly predict the minoritarian class (as we can see in Figure 5, Rifampin has a Recall of 0.67)

The accuracy across all the classification was 0.97, while of course also affected by the unbalance of the data (positively in this case since the most frequent classes have better metrics). To make the classification more accurate, there is the need for much more balanced data, that is, having the same number of sequences in all antibiotics and resistance mechanisms.

Model improvement

Despite the good results, it is important to keep in mind that the classifiers are not in a finale state at all. Initially, we started with a binary classification of all the antibiotics completely independent form one another, however those systems were not scalable and had accuracy problems. This led us to this version, putting the mechanisms together and adding other important factors in virulence and promiscuity. In addition, we already have in mind how to obtain the third version of AlphaNeuro.

Other than the aforeamentioned data problem, which can be simply solved by finding other sources to obtain more data and of higher quality, we still need to add some factors that were left out. In the Promiscuity classifier for instance, the transduction mechanism was disregarded due to poor performance. In the future we plan to look and apply possible parameter tuning algorithms to find the right configuration of learning rate, funcions, epochs etc. This would allow us to optimize the systems even further.

In the Resistance classifier in particular, there is the plan of adding even another sub-network that allows us to discriminate between mechanisms of antibiotic target alteration by the resistant pathogens, adding more variability to the mix and therefore providing a more complete analysis to the particular mechanisms each gene/sequence is using to cause trouble in healthcare. Additionally, more sub-networks are planned to be created for those antibiotics that are big families, such as betalactams. Different experts have teached us that some families have sub-groups that behave very differenly from one another. With the right data, we could create specific classifiers for sub-groups inside a family of antibiotics and deal with the problem in a more complex manner.

Finally, on the same line of creating sub-systems, we have planned to add more variation to the False class of the Main Discriminator, that is, other types of genes that improves the variability of the sequences that will be stored there, creating a more robust check for genes that are of no interest towards our problem.

Team:UPF Barcelona/Alpha results

Results

Metrics used

Results

Definitions

Main Discriminator

Promiscuity model

Virulence model

Resistance model

Model improvement