Team:Tongji Software/Model

  • Summary
  • Data Collection
  • Three scores
  • Following Analysis
  • Extra Analysis
  • Details


When predicting the relationship between phages and bacteria, our core starting point is: bacteriophage and its host bacteria must have correlation in genome sequence.

Fig. Model

Therefore, as shown in the figure, our analysis process can be divided into 4 parts:

Ⅰ Collect genomic data of bacteria and phages;

Ⅱ Through alignment-based and alignment-free methods, three scores are obtained;

Ⅱ-1 Alignment-based Method: Extract Spacers from the bacterial collection, and then use the BLAST tool to align the Spacers to the phage genome, obtain Scorespacer_Blast;

Ⅱ-2 Alignment-free Method: For the genome collection of bacteria and phages, calculate the frequency of k-mers and analyze the difference distribution to obtain ScoreKmer_freq; For the genome collection of phages, the potential host genome Markov model is established and the bacteria prediction score based on the above model is obtained, that is ScoreMarkov.

Ⅲ Normalize the three scores obtained, and perform Topisis analysing based on the entropy weight method to obtain the final Score and ranking;

Ⅳ As for the Superbugs dataset, for the set of pending superbug-phage pairs we have obtained through the above process, we further carry out the removal of prophage and the clustering remaining phages.

Data Collection

The dataset is divided into database and benchmark dataset. The detailed collection process is shown in the PDF document below.

Three score reveal genome sequence correlation

Alignment-based method

Using the CRISPR system, bacteria place a short genome sequence fragment of an infecting phage, typically 25–75 base pairs (bp) long, as a spacer into a CRISPR array, which is thought to reflect the interaction of bacteria-phages.

As a result, identifying phage-bacteria links by CRISPR spacer alignment is likely to be the most suitable method for detecting recent phage-host interactions, such as within a metagenomic sample where both bacteria and virus components have been sequenced.

Alignmengt-based method

Based on the biological background mentioned above, in this part, we collected Spacers set through the CRISPRCas++ database and the CRISPRDetect tool, and then used BLAST to align the Spacers set to the phage genome to predict the links between the bacteria genome and the phage genome, and score it. Get Scorespacer_Blast.

However, Spacer-Alignment-based method can only be used for a small number of viruses, because only 40%–70% of prokaryotes encode CRISPR systems. Therefore, we have considered the alignment-free method.

Alignmengt-free method

For the alignment-free method, we started with k-mer.

k-mer are short subsequences of a specified length extracted from the genome sequence. Phages have been suggested to ameliorate their genomic that of the host they infect. Possible mechanisms are an evolutionary pressure to avoid recognition by host restriction enzymes. Thus, k-mer composition might be consistent between phages and their host, providing a signal for computational prediction of phage–host relationships.

Fig. Alignmengt-free method

Based on k-mers' composition and after referring to many literatures, we finally adopted two models as supplementary indexes for evaluating the strength of phage-bacteria interaction. They are Prokaryotic Virus Host Predictor Model (PHP) and WIsH model, and then we get ScoreKmer_freq and ScoreMarkov;

Following analysis of the three scores

After obtaining the above three scores, we further normalized the three scores, and performed Topisis analysing based on the entropy weight method to obtain the final Score.

Fig. Following analysis

Extra processing of the superbug-phage combination dataset

Removal of lysogenic phages(prophages)

Lysogenic phage, also known as temperate phage, whose genes integrate with host bacteria's chromosomes and do not produce progeny phage, but whose DNA can replicate with bacterial DNA and be passed down as the bacteria divide. In this project, the presence of prophage made the model predict higher values for prophage-superbug interactions. In addition, prophages are not applicable in phage therapy because they cannot lyse bacteria, which will reduce the therapeutic effect of phage therapy.

Clustering remaining phages

In real life, for severe bacterial infections, medical professionals will use several specific bacteriophages to attack and kill bacteria. This method of using multiple phages reagents to treat diseases is called phage cocktail therapy, which is also the most common form of phage therapy. In this project, we also took this into consideration. By reviewing the literature, we know that the phages in the cocktail should satisfy:

①Each phage can separately lyse bacteria;

②The genomic similarity distance among phages should be as far as possible.

Therefore, we clustered the remaining phage collections according to their genome similarity to obtain different phage clusters. Then, users can choose phages from different clusters to construct a “specific phage cocktail”.


The details are as follows: