Team:Heidelberg/Software


Software





Disease state prediction using 16S rRNA from gut microbiome

Abstract

A logistic regression model helps to understand the influence different features have on another feature [1]. In our case, this would be the influence of the composition of the microbiome on the health of the patients. This process resembles what a calculation which several linear regression models would do, the only exception being that the features are not binomial when training a logistic regression model [1].

Furthermore, a Graph Attention Network has been implemented. Based on the principal components of the microbiome composition (features) of each patient, a kNN-graph is generated. The Graph Attention Network is trained on two convolutional networks with self-attention and predicts whether the composition of a patient’s microbiome is indicating a diseased or a healthy status.

Background

A healthy reference microbiome composition is not known [2]. Certain diseases have been linked to the depletion or increased occurrence of typical microorganisms [2], [3]. However, the interplay and coherences between microorganisms in the microbiome is not well understood.
Machine learning and deep learning in particular have shown to extract hidden, possibly non-linear relations between the features and the prediction [4]. For instance, deep learning algorithms are used to predict human disease states from genotype marks and SNPs [5]. Nonetheless, deep learning is difficult to interpret [6]. Linear models can be seen to provide interpretability as a post-hoc explanation method. This is due to the possibility to extract the weights of the linear model. Thus, interpretability can be seen as an inherent feature of linear models.
This work aims to use machine learning and deep learning to target the prediction of a disease state from the microbial composition and provide interpretability for the importance of certain microorganisms for the prediction.
In this project, we will try using 16S rRNA data to model the composition of the microbiome because 16S rRNA data can distinguish organisms up to the bacterial strain level [7], [8]. This will lead to the most appropriate model to analyse which bacterial strains would show a high difference in abundance when comparing diseased and healthy patients.

Data preprocessing

First, the different steps to acquire the data frame will be explained shortly. We used the fasterq-dump function in SRA-tools from NCBI to extract the FASTQ-files of the GSE projects mentioned above. To determine the microbial composition, we used the tool kraken2 [9]. It aligns the sequences in the FASTQ file to microbial genomes. By this, it counts the abundance in diversity of the microbiome. This data was then used to construct a data frame containing the absolute numbers of counted sequences for each bacterial strain type for each patient. We only managed to find the 16S rRNA data of 37 patients, while 19 of those were adults which we discarded, since their microbiome differs fundamentally from those of children on which our focus lies in this project. We suspect that the reason for this is simply that the disease is so special that not much sequencing efforts has been put into this question. In total we only had access to the data of 18 diseased children, while there was an abundance of sequenced data for healthy patients which we accessed through the HMP database. We tried to ensure that the patients’ features such as age and such were mostly the same. Since we were limited to the data of 18 diseased patients, we added only 97 healthy patients to our dataset. Although the ratio between healthy and diseased patients is not exactly equal, a machine training a deep learning model should be capable of extracting the differences between healthy and diseased people.

Since the data was extracted from different projects, not all datasets will include the same bacterial strains. To address this problem, we firstly calculated the relative abundances of each bacterial strain for each patient. Thereafter, we set a threshold for filtering all the bacterial strains which had more zeros as entry than the threshold indicating that the column wouldn’t hold much data or would be evaluated to be too important by the model simply because e.g. the diseased patients didn’t have any entries for this bacterium. As we can not rule out the possibility that this is simply due to the sequencing methods not being entirely identical, we had to cut out these bacteria.

Figure 1: Extracting of data. Created with BioRender.com

Logistic Regression

Results

Our analysis resulted in a model which shows an AUROC of 0.97 which would normally be exceptionally good, but in our case one could argue that too little data is the reason for this result. But one more reason could be the missing of relevant variability in the data (see Fig. 2). This could be the case due to the selection of data from two different projects, since no other data was available. The most striking feature supporting this theory would be the unusual graph which is shown for the ROC curve, which essentially demonstrates the performance in terms of classification errors of the model at specific thresholds.

Figure 2: ROC curve of the logistic regression model.

Our analysis showed that only four of the remaining filtered bacteria showed a higher relative abundance in the diseased patients than in the healthy patients. Of those, the one which stood out the most was Fusobacterium prausnitzii. Being gram-negative increases the chance of also expressing type-4 pili thus enabling it to undergo natural transformation as some gram-positive bacteria also express type4 pili. We did not find published reports which explicitly prove our point, but there are several papers which state that the fusobacteria group, while consisting of gram-positive bacteria, in general harbors a relatively large amount of bacteria which express type-4 pili[10], [11], [12].

The analysis done with “SHAP” was executed to find the most relevant feature, which in our case were the bacterial strains. Another important step was the comparison of the relevance value determined with SHAP with the weights in the linear regression. The different SHAP values indicate the importance of each bacterial strain. The higher up the bacterial strain is, the more important it will most likely be for determining whether a patient is diseased or healthy. The Fusobacterium prausnitzii strain was the only one having both the capability to express type 4 pili while also showing a higher abundance in diseased patients than in healthy patients. Because the SHAP value wasn’t that high in comparison to other bacterial strains, it shows that the difference in abundance doesn’t differ that much in diseased and healthy patients, also shown in figure 3. This is also shown when calculating the means of the relative abundances for diseased and healthy patients separately.
But it might still be a trend specific for diseased patients which is not recognizable in our analysis yet due to the lack of data.

Figure 3: Shap values of bacterial strains. 171549: Bacteriodales, 816: Bacteriodes, 213810: Ruminococcus champanellensis, 1783272: Terrabacteria group, 200643: Bacteroidia, 2: Bacteria, 976: Bacteroidetes, 186802: Eubacteriales, 853: Faecalibacterium prausnitzii

Discussion

While we paid special attention to whether the same sequencing methods were used or whether the same assay-type was used, we did not manage to find data for healthy patients with the same sequencing protocol as for our diseased patients. While that is the case, we tried to minimize the differences in data caused by the different sequencing protocols. As an example, the use of different sequencing machines in the healthy and diseased sample can cause this effect. We found a paper that stated that the differences were not that big and that the ones which were found could have come from mistakes in the PCR or some other mistakes that were made in the lab. But even so several of these small mistakes of such a scale could rapidly result in a loss of data since the unclassified sequences were not considered as there was no way to assign them [13].

However, not only the loss of data was a rather big issue, but also the small number of samples or patients in general.
We could not observe any change in the prediction performance after filtering the data. We suspect that some additional properties were not documented in the sequencing protocol or its overview on NCBI, that would have been important for our analysis.

While the mean of the relative abundance for this bacterial strain was higher in diseased patients than in healthy patients, the logistic regression weights and SHAP values clearly indicated that the difference was so small that the difference in abundance didn’t have any impact on the decision whether a patient was healthy or diseased. As most abundances stay the same when they are essential for survival overall (e.g. housekeeping genes) we searched for the importance of this strain in the human gut and found several papers which already mentioned its importance in the title. As this bacterial strain is so important that it is needed in the human gut it may even have a natural selection advantage helping it after having taken up the plasmid so that the missing difference in abundance is not that grave but rather a blessing in disguise.

Correlation Features

The correlation between some of the features is quite high meaning that the abundance of some bacterial strains are associated with each other. As some of the strains are as large as the group of terrabacteria this doesn’t come as a surprise. But even though it is not surprising it shows that the interpretability of the SHAP values and the ones calculated by the logistic regression is hindered quite heavily.

Graph Neural Network

Biological data of different type can often be displayed as a network. A graph is composed of an adjacency matrix A, which defines the nodes and vertices and, if applicable, the weights of the vertices, a matrix XN of node features (weights, degrees, categorical attributes, …) and a matrix XV of vertex features (categorical attributes, …) [14]. Machine learning methods, like graph kernels and deep learning, can be applied on graphs. These methods fulfil different tasks like graph similarity measurements, graph classification or node classification.

In this project, we are dealing with an inductive node classification problem. We train our Graph Neural Network on a microbiome composition dataset. The evaluation is completed on a test dataset, which has not been seen by the network before. For each dataset we are generating a separate kNN-graph after calculating the Principal Components, as it displayed in the graphics below:

Figure 4: Mode of operation of GNNs. Created with BioRender.com

For the calculation of the principal components and the kNN graph, the tools implemented in scikit-learn are used [15]. Each node is characterized by its features (Microbial composition) and is associated with a ground-truth label (healthy, diseased). The node classification problem deals with the task to label nodes. A graphical overview of the GNN and its node classification is provided in the figure below:

Figure 5: Creation of a GNN. Created with BioRender.com

In a process called embedding, the encoder maps each node of the graph to a low-dimensional vector zi:
ENCODER(nodei) = zi
In a graph neural network, several complex node embeddings can be applied. In the case of the proposed neural network, the node embedding is based on local neighbourhoods.

The learning in the neural network is happening by backpropagating the loss through the layers of the network. The loss is calculated by a criterion which compares the ground-truth label with the predicted label [16]. In each graph neural network layer, the node and edge representations are calculated through the recursive neighborhood diffusion (message passing). For this, the features from the neighbors of a graph are aggregated [16]. In a graph convolutional network the aggregation isotropic (each neighboring node contributes equally to the node, which is updated) [17]. Like all graph attention networks this model allows assigning deviating weights to the nodes in the neighborhood aggregation process (anisotropy operation), which is the major improvement to a GCN [17, 18, 19]. The kth layer embedding for a specific graph node v can be calculated using this formula:

hvk = σ * (Wk * ∑ u ∈ N(v) ∪ v (huk-1 / √ |N(u)| |N(v)| ))

The formula takes the neighbor’s previous layer embeddings and the previous layer embeddings of v into account. Our neural network consists out of 2 layers and 8 attention heads. The activation function σ is the exponential linear unit non-linearity. On the second layer, a softmax activation is applied to receive the binary classification. We apply a L2 regularization with dropout with λ = 0.0005 and dropout with p = 0.6. We train our model for 100 epochs with a batch size of 2. This proposed neural network adapts the concept of Ravindra et al., 2020.

Results

A UMAP (Uniform Manifold Approximation and Projection) has been performed on the training data and test data before PCA and kNN calculation, as it can be seen in the figures below:

Figure 6: Training data UMAP.

Figure 7: Test data UMAP.

Through dimensionality reduction, healthy and diseased patients can be divided. However, the small number of patients only gives a sparse impression on the differences between healthy and diseased patients. Training and test dataset are shuffled and split, so diseased patients are equally distributed, which can be proven with the UMAP.

With the test dataset we reach a ROC of 0.85, as it can be seen in the figure below:

Figure 8: ROC curve for GNN.

The accuracy and ROC scores of the training dataset were in average 8 % higher, than the accuracy and ROC scores of the test dataset. This behaviour indicates an overfitting.

Discussion

To the best of our knowledge, no iGEM team has presented a graph neural network yet. Therefore, we believe that our contribution is a first step in the application of graph neural networks on biological data within the iGEM competition. The overall ROC score achieved with the GNN is lower than the overall ROC score achieved with the logistic regression. Although hyperparameter optimization has been applied to this model, a different setup could help to improve the ROC score of the GNN.

Ravindra et al., 2020 are using single-cell RNA-seq datasets in their publication. By creating kNN-graphs on single-cell RNA-seq matrices, bigger graphs of up to 5,000 cells are created. Also, a single-cell is annotated with around 20,000 features [19]. In comparison, the graphs we created only consisted of 11 patients, for which 9 features were provided. Therefore, the full potential of GNN may be reached with bigger kNN graphs and a more comprehensive node feature annotation.

The initialization of the neural network seems to have a not negligible effect on the resulting ROC curve. Running the neural network with the same hyperparameters led to a big variation of the FPR, TPR, accuracy and ROC-Scores. We expect this aspect to vanish when using more data, as the training is longer.

As described in the results, the accuracy and ROC scores of the training dataset deviated from the accuracy and ROC scores of the test dataset. We increased the L2 regularization and weight decay to prevent overfitting, but could not solve this problem. Thereby, we have identified the small batch sizes and the small size of the training and test dataset as the main reason for overfitting of the model.

Also, the graph neural network cannot be interpreted like the logistic regression. Newer tools, such as the GNNExplainer offer the interpretation of the attention heads of graph attention networks [20]. However, these tools cannot be applied if feature correlations apply, which can be expected in this dataset, as it already has been described in the discussion concerning the logistic regression model.

Fork us on GitHub! Fork us on GitHub!

References:

[1] Sperandei S. (2014). Understanding logistic regression analysis. Biochemia medica, 24(1), 12–18. https://doi.org/10.11613/BM.2014.003

[2] Lloyd-Price, J., Abu-Ali, G., & Huttenhower, C. (2016). The healthy human microbiome. Genome medicine, 8(1), 51. https://doi.org/10.1186/s13073-016-0307-y

[3] Lopez-Siles, M., Martinez-Medina, M., Busquets, D., Sabat-Mir, M., Duncan, S. H., Flint, H. J., Aldeguer, X., & Garcia-Gil, L. J. (2014). Mucosa-associated Faecalibacterium prausnitzii and Escherichia coli co-abundance can distinguish Irritable Bowel Syndrome and Inflammatory Bowel Disease phenotypes. International journal of medical microbiology : IJMM, 304(3-4), 464–475. https://doi.org/10.1016/j.ijmm.2014.02.009

[4] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

[5] Liu, Y., Wang, D., He, F., Wang, J., Joshi, T., & Xu, D. (2019). Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean. Frontiers in genetics, 10, 1091. https://doi.org/10.3389/fgene.2019.01091

[6] Lipton, Z.C. (2018). The mythos of model interpretability. Communications of the ACM, 61, 36 - 43.

[7] Osman, M. A., Neoh, H. M., Ab Mutalib, N. S., Chin, S. F., & Jamal, R. (2018). 16S rRNA Gene Sequencing for Deciphering the Colorectal Cancer Gut Microbiome: Current Protocols and Workflows. Frontiers in microbiology, 9, 767. https://doi.org/10.3389/fmicb.2018.00767

[8] Johnson, J. S., Spakowicz, D. J., Hong, B. Y., Petersen, L. M., Demkowicz, P., Chen, L., Leopold, S. R., Hanson, B. M., Agresta, H. O., Gerstein, M., Sodergren, E., & Weinstock, G. M. (2019). Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nature communications, 10(1), 5029. https://doi.org/10.1038/s41467-019-13036-1

[9] Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0

[10] Suau, A., Rochet, V., Sghir, A., Gramet, G., Brewaeys, S., Sutren, M., Rigottier-Gois, L., & Doré, J. (2001). Fusobacterium prausnitzii and related species represent a dominant group within the human fecal flora. Systematic and applied microbiology, 24(1), 139–145. https://doi.org/10.1078/0723-2020-0001

[11] Olsen, Ingar. (2014). The Family Fusobacteriaceae. 10.1007/978-3-642-30120-9_213.

[12] Ligthart, K., Belzer, C., de Vos, W. M., & Tytgat, H. (2020). Bridging Bacteria and the Gut: Functional Aspects of Type IV Pili. Trends in microbiology, 28(5), 340–348. https://doi.org/10.1016/j.tim.2020.02.003

[13] Ranjan, R., Rani, A., Metwally, A., McGee, H. S., & Perkins, D. L. (2016). Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing. Biochemical and biophysical research communications, 469(4), 967–977. https://doi.org/10.1016/j.bbrc.2015.12.083

[14] Muzio, G., et al. (2021). "Biological network analysis with deep learning." Briefings in Bioinformatics 22: 1515 - 1530.

[15] Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." J. Mach. Learn. Res. 12: 2825-2830.

[16] Zhou, J., et al. (2020). "Graph Neural Networks: A Review of Methods and Applications." ArXiv abs/1812.08434.

[17] Kipf, T. and M. Welling (2017). "Semi-Supervised Classification with Graph Convolutional Networks." ArXiv abs/1609.02907.

[18] Velickovic, P., et al. (2018). "Graph Attention Networks." ArXiv abs/1710.10903.

[19] Ravindra, N. G., et al. (2020). "Disease state prediction from single-cell data using graph attention networks." Proceedings of the ACM Conference on Health, Inference, and Learning.

[20] Ying, R., et al. (2019). "GNNExplainer: Generating Explanations for Graph Neural Networks." Advances in neural information processing systems 32: 9240-9251.