Our project is inspired by phagotherapy for superbugs.
In recent years, abuse of antibiotics has led to the emergence of superbugs and people infected with them often lack effective cures. This makes biological and medical workers pay more attention to the research of superbugs. Among multiple treatment methods, phagotherapy is undoubtedly a promising method because of its effective and precise treatment of superbugs. By this mean, we can not only solve the problem of antibiotics resistance, but also achieve targeted therapy.
With a plenty of background investigations, we established the 'golden rule' of project analysis: If the interaction between a phage and a bacterium exists, it must be reflected on their genome sequence. Then, we designed a research method based on the analysis of genetic sequences of bacteria and phages. It mainly included three aspects: sequence data downloading and cleaning, genetic sequence correlation analysis, and phagotherapy recommendation scoring. In addition, in order to provide a more reliable phagotherapy scheme, we also performed prophage screening and cluster analysis. These mentioned results are finally presented in Phage-MAP webpage.
Data Downloading and Processing
We downloaded 10819 bacterial genome sequences from the NCBI GenBank (2021.5), and all of them are labeled as "complete genome". At the same time, in order to make more effective phage recommendations for drug-resistant bacteria, we also selected 16 typical superbugs through literature and expert investigation, then integrated them into a smaller superbugs dataset, and we separately analyzed them in our later workflow.
The genomes of the phages come from the Genbank database. We have screened and annotated 14571 phage sequences. According to the sequence type label of the "gbk" file, we selected the complete genome sequence data. Then, we deduplicated multiple sequences of the same species, and finally applied the NCBI Taxonmy database to annotate the species information.
Three scores reveal genome sequence correlation
After preliminary research, we bred inspiration from CRISPR system and we decided to use CRISPR-Spacer alignment-based method. We first used the "CRISPRDetect" tool and "CRISPRCas++" database to obtain the spacer sequences with high probability in bacteria, and then we adopted the "BLAST" tool to align these short spacer sequences to the phage genome. The BLAST scoring results with high quality(e-value reaches a certain threshold) were screened. Finally, for a phage-bacteria pair, we analyzed the number of alignment results and bit-score values to get Scorespacer_Blast.
Then there was genome correlation analysis based on k-mer frequency. In this part, we chose the tool "PHP" to implement this part of the analysis on basis of investigation research. The working principle of "PHP" is to count the distribution of 4-mers, and then predict and score by constructing a Guassian Mixture probability model. So as to we can realize the prediction of the host of the phage and obtain the second score ScoreKmer_freq.
We also performed a genome correlation analysis based on the Markov model. In this part, we chose the analysis tool "WIsH" to implement analysising. The working principle of WIsH is to construct 8-order Markov models based on the statistical analysis of bacterial sequence k-mer, analyze the probability of corresponding k-mer in the phage sequence according to the modified model, then perform logarithmic processing to obtain scores and further sort scores to demonstrate the result of host prediction. By this mean, we obtain the third score ScoreMarkov.
Phage Recommendation Scoring
After obtaining the above three scores, we conducted a statistical analysis of each score, and designed corresponding normalization processing operations according to their data distribution characteristics. Next, we obtained the normalized three scores: Scorespacer_Blast, ScoreKmer_freq, ScoreMarkov.
Finally, we used the TOPSIS analysis method with the entropy weighting method to comprehensively evaluate and get the final score Score, and then recommend the potential phagotherapy scheme in descending order.
Prophage Analysis for Superbugs
Given prophages cannot lyse bacteria, prophages are actually not suitable for phagotherapy and will reduce the therapeutic effect of phagotherapy. Therefore, in order to further improve the reliability of the analysis for superbugs, we performed a prophage analysis on the sequence of the superbugs. In this part, we selected the prophage searching software "PHASTER" to analyze the superbugs to make our prediction results more reliable.
The main principle of "PHASTER" is briefly described as follows: first, bacterial sequence is compared with the phage sequences by BLAST, and the alignment results with high quality are clustered into prophage clusters by the DBSCAN algorithm, then the software performed evaluation of the distance between clusters to obtain the prediction result and return the prediction result of prophage prediction.
According to the scoring results obtained by "PHASTER", we directly deleted the prophage-bacteria pairs with high prophage-evaluated score.
Phage Clustering Analysis
According to literature research, we chose "cd-hit", a sequence clustering tool based on k-mer analysis of sequence similarity, to perform clustering analysis on phages with the relatively high comprehensive score to provide species-related features, thereby recommending phage cocktail treatment scheme for therapy.
After completing the preliminary investigation, we carefully designed the module composition of Phage-Map project. After several discussions, we finally designed three main parts: Bacteriophage Bay provides the interface for downloading original data, Phage Finder shows the database retrieval function and visualization results, and Interactive MAP contains the recommendation of superbugs' treatment. These sections provide a convenient interactive interface for users to use the database as a research aid. And, we also provide users with a comment and opinion window at the bottom of the page. You can just click here to learn about our project or just go down to the introduction.
Figure: Overview of the Design section
Bacteriophage Bay section provides a plug for users to download our data, consisting of “bacteria.csv”, “bacteria_phage_score.csv”, “bacteria_spacer.csv”, “bacteria_taxon.csv”, “bug_score_with_name.csv”, ”phage.csv”, ”phage_bug.csv”, ”result.csv”, ”score_bug.csv”, ”score_with_name.csv” and ”super_bug.csv”. These data contain the information of relation between bacteria and phage. Download them on your own computer will be of benefits to downstream analysis.
Figure: Bacteriaphage Bay
Figure: E-R diagram of database
It's worth mentioning that the database contains more than half a million records. And we used MySQl, a relational database system to store website data. MySQL is smaller in size and faster in command execution compared with other database systems. It is an open source software and provides a free version of application. Compared with other large database system settings and management, it is less complex and easier to use.
Phage Finder section provides a searching engine for users to query our database more conveniently. The raw data includes the basic information, txid, name of species, and scoring evaluation result between phage and bacteria. Users can input the species they are interested in to search bar and get filtered information. Then, the interactive map demonstrating the result will be showed in the frame and help users to understand the relationships more easily. Also, we provide a download plug for user to download the filtered data, which can facilitate their following work.
Figure: Phage Finder
The last section is the interactive MAP which contains the data of superbug-phage interactions and their evaluation score. In this page, users can select several superbugs they are interested in. And then, the interactive map about these superbugs will be displayed in the bottom of the page. Users can click the button in this map and redirect to NCBI to learn more information about this species. We sincerely hope that researchers can find targeting phage more effectively.
Figure: Interactive MAP
To make a user-friendly toolbox, apart from the core algorithms, we mainly used the following open source tools and frameworks in the development of the website:
Figure: Interactive MAP
In terms of deployment, we used Nginx and Tomcat as the proxy layer for the above services and deployed them on the Alibaba Cloud Elastic Compute Service based on the Ubuntu environment to provide support for the operation of the website. The cloud elastic compute service provides more stable, secure, and highly resilient network services, which can be expanded flexibly according to the business requirements. It also can be dynamically migrated with high reliability.
During the development process, we also used the following tools: front-end development and debugging tool Visual Studio Code, API testing tool Postman, back-end development and debugging tool: IntelliJ IDEA, database development and debugging tool: Navicat, team collaboration tool Feishu, version control tool Git and code hosting tool GitHub.
During our researching period, we also carried out reflection on our analysis methods in time, and then proposed some further improving plans, which can be applied to the later update of our database's analysis function.
Scoring Analysis Based on Sequence Correlation
When we investigating methods for our project, we discovered a number of analysis tools for the analysis of the correlation between the sequence of bacteria and phages based on genome sequence. With sufficient time, we can further investigate these various analysis tools, then summarize and integrate their analysis models and performance results. By this way, we can adopt more effective models (such as convolutional neural networks), and maybe develop new bacteria and phage correlation analysis tools.
Comprehensive Scoring Analysis
Multi-score evaluation is a key proposition in a evaluation problem. We can further investigate and apply some classic comprehensive evaluation models, such as gray evaluation models. Also, on the basis of obtaining more characteristic analysis scores and more authoritative verification dataset, we can perform feature engineering analysis on them according to machine learning methods, and then construct and tune a fully-connected neural network to achieve a better comprehensive evaluation result.