Team:AFCM-Egypt/Directed-Evolution

Software

An Approach to Deep Learning-Assisted Zero-N Directed Evolution of Proteins

Summary:

Engineering enhanced versions of proteins requires complex processes that integrate various measurements. This process is often overwhelmed by challenges that are specific for each engineering case. Herein, we have developed a computational approach that integrates predictions of mutational effects on evolutionary fitness with encodings for local and global features representations of protein sequences. These features were used to design a machine learning framework in an approach to accurately predict fit variants without necessitating experimental inputs.

Figure 1. Shows our pipeline for predicting mutational effects on evolutionary fitness and representing mutants for an unsupervised learning approach to predict fit variants.

Index:

  • Prediction of Independent and Epistatic Effects of mutations.
  • Unsupervised Fitness Prediction for Zero-N Directed Evolution.
  • Adaptation of our protein engineering method to antibodies.
  • Adaptation of our protein engineering method to RNA-Binding Proteins.

Availability: Notebooks demonstrating the functions and a complete pipeline to run our software on custom proteins are now available on Github

1- Prediction of Independent and Epistatic Effects of mutations

Homologous Sequences Retrieval and Evolutionary Alignments:

Methods:

In our approach to fitness prediction of single and combinatorial mutant libraries, We have based our work on multiple sequence alignments (MSA) of query proteins using different databases including; PDB (Sussman et al., 1998), RefSeq-Protein (O'Leary et al., 2016) and tools including; blastp (Johnson et al., 2008), hhblits (Remmert et al., 2012), ClustalW (Thompson et al., 2003), PSI-BLAST [Altschul et al., 1997] and inspired by Alphafold2 we used MMseqs2 (Steinegger et al., 2017). These alignments can be automatically generated via our functions or provided by the user for further data generation.

Results:

We allow different filtration levels of sequences retrieval and alignment process, dis-allowing; redundant sequences (> 99% of similarity), non-significant e-value sequences, too gapped sequences (>10%) and small coverage sequences (<75%). Subsequently, a consensus sequence is constructed from the alignment at frequency cut-off of 80% which will be used later for further computations.

Prediction of Independent (Positional) Fitness Effects of Mutations:

Methods:

For prediction of mutational effect based on positional change, we have applied a function that computes independent effect of changing amino acid X to Y at position i by computing the frequency of Y at position i within the sequence alignment from a position specific scoring matrix (PSSM) inferred from a consensus sequence, then subtracting the logarithm of observed frequency of X to Y mutation within all positions within the alignment. Finally, multiplying this by the conservation of position i as a function of all residue frequencies divided by the number of the alignment sequences bearing position i (Consi).

Eq. 1

Results:

Figure 2. shows that Δrho AFCM were positive in 75/106 runs versus GEMME (Laine et al., 2019) and 71/106 vs EVmutation (Hopf et al., 2017).

Prediction of Epistatic (Global) Fitness Effects of mutations:

Estimation of Evolutionary Couplings:

To accurately estimate epistatic global effect of mutations whether in single position deep mutational scanning and/or within combinatorial libraries of mutations, we should be able to have a quantifiable measurement of the contacts between the amino acid X at position Y with all other residues occupying other positions in sequence L. We have relied on our approach for coupling matrix generation on pseudo-likelihoods computed by a random field Markov model using CCMPred algorithm [Seemayer et al., 2014] used previously by iGEM team GO_Paris-Saclay_2020. Subsequently, these coupling matrices and features represent coupling for each amino acid in each position with all other positions occupied by all amino acid possibilities including gaps. So, we have used this data both for estimating epistatic effects of mutations and for encoding the local evolutionary data of studied proteins. These coupling data have been used for these purposes as it has been shown their potential in estimating protein consevative features and structures as well.

Method A: Prediction of Epistatic Effect using evolutionary distances:

In this approach we postulated that the epistatic effect of a mutation Y at position i could be determined as a function of the minimum fitness change within all other residues in sequence of length L required to observe such a variant in a similar fashion to the work done by (Laine et al., 2019). This can be determined by measuring how far in the evolutionary alignment we should go to observe mutation Y at that position i. This has been computed via Biopython package phylogenetic distance calculator between sequences bearing Y and the query sequence Q of the user. Then, the minimum value is considered as the minimum evolutionary fitness change. This computation can be further refined by multiplying this distance with the magnitude M of i couplings with other residues in sequence L.

Eq. 2

Method B: Prediction of Epistatic Effects using Residue Frequencies:

However, the pairwise fitness effect of mutation X to Y at i on other residues occupying rest of positions in a sequence of length L could be a representation of the probability of observing evolutionary sequences bearing residue Y at position i versus the probability of observing mutants with X at i. This definition could be also improved if we considered the magnitude of contact effect between position i and rest of positions in sequence L. This magnitude M could be calculated if we have a representation vector for the pairwise coupling potentials of i with other residues in sequence L.

Eq. 3

Method C: Prediction of Epistatic Effects using Residue Couplings:

This method makes use of the evolutionary couplings computed using CCMPred. In which the total energy function of a sequence equals the sum of all singleton and pairwise couplings of that sequence (Equation 4). So to estimate the epistatic effects upon introduction of single or combinatorial mutations, This can be computed by subtracting the energy function of the query sequence So from the energy function of the mutant sequence S (Equation 5). For dimensionality reduction purposes and particularly for large proteins -long sequence- one of the approaches applied is to shorten the long vector to represent only positions where I is mutated in the combinatorial library. Assume 4 positions are allowed to mutate, the length of the vector will be (4 * L) + Ls where Ls represents the singletons couplings. However, the aforementioned approach was proven not to be as effective as total representations, however it can be helpful for time-efficient screening of mutants.

Eq. 4
Eq. 5
Figure 3. shows that Δ rho AFCM were positive in 69/106 runs versus GEMME (Laine et al., 2019) and 86/106 vs EVmutation (Hopf et al., 2017).
Figure 4. shows that experimenting the three methods showed that method C is most sensitive to representations of evolutionary fitness amongst the proposed methods.
Figure 5. shows that Δrho AFCM were positive in 91/106 runs versus GEMME (Laine et al., 2019) and 79/106 vs DeepSequence (Riesselman et al., 2017).

Discussion:

The aforementioned functions not only can help us to estimate fitness of mutants and optimize navigation through greedy search of mutational landscapes, but they can also help us to estimate other evolutionary relevant fitness-related characteristics such as escape mutations for viruses, which are essentially mutations of acceptable viability but with marked fitness effect.

In addition, mutations of positive fitness can be evolutionary prohibited if they lead to compromised viability below an acceptable level, so these mutations might not be observed in nature for sake of a higher order biological system integrity, however, using computational approaches and for artificial systems of protein engineering and cell-free protein synthesis, these mutations can be captured for their high impact on fitness without fear to compromise the biological system, an advantage that might not be available through global evolutionary dynamics.

Assessment of Physicochemical Parameters for protein ecodings

Methods:

Physicochemical parameters are very important for proper representation of proteins for machine learning tasks, this is due to the relationship between these parameters with functional and structural activity of the protein of interest. The AAindex database incorporates hundreds of physicochemical indices representing amino acids according to the measured index retrieved from experimental literature. However, these hundreds of indices would be dimensionally unfeasible to incorporate within one encoding vector. Thus, a dimensionality reduction approach has been applied by (Georgiev, 2009). through principal component analysis (PCA) of more than 500 indices of AAindex database resulting in 19 principal components representing these features.

We have worked to create functions that can encode proteins and/or compute their component values to be used for representation of studied proteins and inference of important physicochemical features for mutant sequences.

Availability: All the matrices and functions are available through our software notebooks.

Prediction of Local Evolutionary contexts for mutant sequences

Methods:

Local evolutionary features are essential indicators for activity, selectivity and function of homologous sequences of protein families. Herein, we have applied a function to encode protein sequence using data inferred from the coupling matrices and raw features of pairwise interaction generated by pseudolikelihood values from Markov random field model implemented through CCMPred. For this approach, each sequence is represented by a long vector resulting from flattening a matrix -for sequence of length L- with dimensions L by L+1, where the single site couplings are also added for each residue S at position i, and each residue is represented by a vector of length L+1. For dimensionality reduction purposes and particularly for large proteins -long sequence- one of the approaches applied is to shorten the long vector to represent only positions where I is mutated in the combinatorial library. Assume 4 positions are allowed to mutate, the length of the vector will be (4 * L) + Ls where Ls represents the singleton couplings .

Eq. 5

Prediction of Global Evolutionary contexts for mutant sequences

Methods

Global evolutionary contexts are essential features when it comes to protein language models such as TAPE (Rao et al., 2019), trRosseta (Yang et al., 2020) and UniRep [Alley et al., 2019]. These are self-supervised models trained on huge datasets of protein sequence and function annotations. Thus, they are able to learn non-trivial features essential for the process of training deep learning models using protein sequences. However, these models lack local evolutionary contexts. Thus, fine-tuning and retraining these models can enhance their performance over selected protein sequences of diverse functions, a similar approach has been tried to improve UniRep resulting in eUniRep model (Biswas et al., 2021) which is selectively fine-tuned for homologues of desired protein sequence.

In our approach, we decided to use the power of these global protein representation models by integrating these features with the local context encoder so that each sequence is represented by essentially three elements including; physicochemical parameters, local evolutionary contexts and global evolutionary contexts. Among these models we found that TAPE transformer and Unirep encoders might provide maximum feature representation advantages for our model. Of course there might be some considerations for the length of the encoding vectors, so dimensionality reduction techniques might be needed to represent these sequences with optimum computational complexity and without compromising essential features. This approach can help generalize our approach for sequences with diverse complexity in terms of sequence lengths.

Discussion

The aforementioned approaches have been used not only to validate our method but also to better characterize genetic-engineering related protein within iGEM registry and/or in our engineering designs. This aims to characterize their mutational landscapes and select best fit mutants to perform desired functions in a specific context. This is a primary step to crowdsource protein engineering for iGEM teams.

Availability: Notebooks demonstrating the functions and a complete pipeline to run our software on custom proteins are now available on Github

2- Unsupervised Fitness Prediction for Zero-N Directed Evolution

Introduction:

Aiming to create a zero-N platform for protein engineering that can be accustomed to virtually any protein, we decided to build two different models that can be fit for this task. Both networks should be able to learn from unlabeled data, as most protein fitness engineering tasks are overwhelmed by the necessity of fitness data. So far machine learning models have been successful to reduce the number of required experimental fitness data points for training a satisfactory computational model. However, still supervised methods largely outperform unsupervised counterparts.

Methods:

For this model, deep neural networks have been constructed using 1-dimensional convolutional neural networks (CNN) and bidirectional long short-term memory (BiLSTM) forms of recurrent neural networks (RNN). These models receive the encodings of protein sequences as inputs to 1dCNN layers that perform the feature extraction process helping the subsequent BiLSTMs to identify the semantic sequences within these features. Subsequently, neural network layers are added and processed through regularization and drop outs. Finally, Dense layers are added and their final states are considered into a probability softmax function to predict fitness of mutant sequence. However, to be able to quantify the variational change of a mutant sequence, The outputs of the BiLSTM layers are summed and compared to the query sequence. This technique could help assess different biological variations within a specific sequence library of a protein. For instance, it can be used to assess antigenic variations, changes in secondary structure and escape mutations of viral proteins. This design can also be adapted for prediction of mutations that can increase mutant fitness but are undesirable due to global cellular constraints. The aforementioned approach can be done in a comparative approach for global embeddings of mutant versions and parent sequences.

Due to time constraints and to simulate selective pressure application as a directed function, we decided to prepare a primary network following that design and compare their regressive probabilities as am auxiliary binary classification task in terms of fit (1) or unfit (0) in relation to wild type sequence for both protein datasets of GB1 (protein G domain 1) at cutoff of 1 for binding IgG-Fc and sensor protein PhoQ at cutoff of 3.287447 for phosphatase activity enrichment.

Results

Figure 6. shows the architecture and performance of our deep learning architecture on 2 proteins with valuable datasets for evaluation of computational directed evolution algorithms.

Clustering model for mutant fitness prediction:

Methods

In this approach we applied a k-means clustering algorithm to cluster encoded protein sequences into at least two main clusters including; fit and unfit mutants. This approach is of great importance as most mutant variants provide noise to the training data. This noise is reflected on the low fitness generated by most of the mutants in the library. Thus, a strict clustering approach is needed to reduce the noise by approximating fit and unfit mutants to distinct virtual centers for clustering purposes. The elbow method was implemented to determine the optimum number of clusters. In case more than 2 clusters were provided, cluster integration was then performed to adapt to unbalancing within the training data, as most of the mutants are unfit. The clustering labels were compared with fitness values which were converted to binaries in relation to wild type sequence fitness.

Results

Figure 7. shows performance of K-means clustering algorithm for GB1 and PhoQ which shows low ability to detect positive fit mutants.

Overall, The Unsupervised neural network showed relatively better performance than K-means clustering, however future approaches combining both strategies could be best for better performance.

Future Perspective:

In the near future, we would like to further test and refine our algorithm through more validation datasets and comparisons with recently developed tools. We would also like to build upon peer-review of our approach for further improvements. We would like also to compare the time complexity of our software for mutational effect prediction versus other algorithms. In addition, we would like to provide analysis for the computational compelxity of our deep learning appraoch for fitness prediction and analyze the computational cost with long protein seqeunces.

3- Adaptation of our protein engineering method to antibodies

Classification Model for Protein family Classification :

Methods:

Initially, we have hypothesized that a classifier model can reliably aid a sequence generator function to classify sequences according to specific functional class from protein family database pfam. Thus, we trained a CNN model to accurately predict functional classes of proteins for more than 1000 functional classes from pfam. Despite achieving 96% of accuracy on testing dataset, the classification model was not able to reflect on variations within activity of one class which seems to be a regression function by definition. However, such a model could be helpful to predict the functional class of the parent sequence. This can help optimize the final fitness prediction towards a specific functional class and also aid generalizing the approach for protein families.

Regression Models for Protein-Protein affinity prediction :

Methods:

Fitness functions can vary in subject of interest, however, considerations of protein activity and target selectivity remain to be of crucial importance to directed evolution experiments. While both can be considered under a generalizable regression problem, it would be helpful to consider target selectivity as an auxiliary function that assists filtration and selection of variants. Herein, we tried to implement a model that can accurately predict the Kd constant of binding between 2 proteins. This can further act as a sensitive selective pressure mechanism rather than a general classifier. The model has been trained using 418 the experimentally available 493 protein-protein complexes with affinity data within the Protein Data Bank (PDB) database (Dunbar et al., 2014). First, we trained the model to classify the interaction, so that we can assess the possibility of protein protein interaction (PPI). Then, a regression network predicts the binding affinity in terms of a regression function of Kd constant. This model enables accurate predictions for binding affinities between proteins, which can further assist the optimization of candidate mutants for a specific target of interest. We applied a universal encoding for each protein within a PPI complex, this encoding represented; Global protein features, local evolutionary features and physicochemical parameters. Then we tested different regression models. The testing dataset was considered within a sample of 75 PPI complex. To avoid and test for overfitting, random sampling has been applied to test for R2 and means squared errors. In addition, Principal component analysis (PCA) has been applied to test for triviality of training features, however, this resulted in negative R2 scores and high mean squared errors.

Results

Figure 8. shows assessment metrics of different regression models for PPI affinity prediction.
Figure 9. shows the distribution and sensitivity of top 3 models on training dataset (left) and testing dataset (right).

Classification Model for Antibodies developments Classification :

Methods:

Antibodies are one of the most important protein classes to be subjected to artificial design, affinity maturation and development pipelines. Thus, one model to fully process libraries of therapeutic antibodies is still needed. However, integration of different models can boost performance of antibody-specific directed evolution models. For this purpose, we worked on a model that can direct a future generator function for antibody development purposes. The datasets for training were adapted from IEDB and SAbDab databases (Dunbar et al., 2014). First, a paratope prediction classifier has been built and tested for accuracy of identifying paratope sequences. Then, these generated sequences can be subjected to affinity regressor against target antigen using PPI affinity model. After getting candidate hits, top sequences can be subjected to a humanization classifier to filter for only sequences with high humanization probability. Finally, it is essential to check for the developability of the antibody sequence. For this purpose we have built a developability model trained with the same dataset acheiving overall AUC of 79.22%, this model can guide the final filtration step by selecting only sequences with high developability probability.

Results

Figure 10. shows the pipeline of adapting protein engineering algorithms for antibody engineering (Right). In addition (Left), testing accuracies for models built by our team against datasets of COVID-19 antibodies.

4- Adaptation of our protein engineering method to RNA-Binding Proteins

Introduction:

RNA-binding proteins represent a particularly interesting class of proteins due to their ability to form regulatory circuits upon interaction with different RNA molecules. In our engineering circuits, we have relied upon MS2 and L7Ae proteins which bind with their RNA riboswitch to render downstream expression to the off state. Thus, we started our pipeline by a network that can predict whether a protein has RNA-binding properties or not. Then, we also added another neural network to identify the RNA family of that specific riboswitch. This is particularly important to design new combinations of riboswitches and RNA-binding proteins. However, This model needs to be supplemented with the ability to predict the affinity or even the interaction between these riboswitches and proteins.

To approximate a neural network for this task, we started by an interesting problem of RNA-protein interaction. This problem focuses on interaction between RNA aptamers and proteins, for that purpose we developed another neural network to identify probability of binding between particular RNA aptamer and target protein sequence. This model outperformed the model developed by (Li et al., 2020) which the dataset for aptamer-protein interaction was used to train originally. Li et al., method scored 0.871 AUC compared to 0.881 using our model. Although this model is still not able to regress on RNA binding affinity with proteins. It provides a step towards having a sensitive classifier for RNA-protein interactions for therapeutic development. In addition, application of a similar design of our PPI affinity model could help address this task in the future.

Methods:

This model is based on the natural language processing model (NLP) model of character-level RNNs. It’s accurately able to predict 32 classes of riboswitches to their RNA families. This makes the ability to select RNA-binding protein particularly easier rather than blind approaches. The same architecture was applied to prediction of RNA-binding proteins. This model was able to binary classify proteins into RNA-binding protein and non-RNA binding proteins with a very high area under the curve (AUC) for testing datasets equivalent to the model developed by (Premkumar et al., 2020) but not high enough to achieve 99 % of AUC accuracy.

In addition, reliable models are needed to characterize the RNA-counterparts in these protein-RNA complexes. Thus, A model was built based on the natural language processing model (NLP) model of 1-dimensional convolutional neural network (CNN). This model is able to binary classify proteins into RNA-binding protein and non-RNA binding proteins with a very high area under the curve (AUC) for testing datasets. This network outperformed the model done by (Zheng et al., 2018) that achieved AUC accuracy of 0.95.

In addition, as our circuits integrate toeholds for regulatory engineering purposes, it was essential to be able to detect the activity of a particular toehold switch as a binary function of On/Off activity. This model was built on a CNN network and has outperformed state of the art models for fine tuning toehold switches based on deep learning. This model used the same training and testing datasets used by (Angenent-Mari et al., 2020), however the testing AUCs for this network outperformed the testing AUCs mentioned by (Angenent-Mari et al., 2020) using different model architectures.

Results:

Figure 11. shows the pipeline for the additional models to accustom our computational platform for RNA-Binding proteins (A). These models have performed fairly on testing against benchmark datasets for each task (B). (C) shows the confusion matrix for the riboswitch model showing accurate distinction between 32 class or riboswitches on testing the performance of the model.

Limitations and Future Perspectives:

This approach should be further supported by more and more validations for mutational effects and fitness predictions using more testing datasets. In addition, different neural network architectures could be implemented and compared to our existing version. As this field is rapidly growing, our results will be compared to recent advances in the field for a more advanced version of our work to be prepared for peer-review purposes. We would like also to compare the time complexity of our software for mutational effect prediction versus other algorithms. In addition, we would like to provide analysis for the computational compelxity of our deep learning appraoch for fitness prediction and analyze the computational cost with long protein seqeunces.

References:

  • Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallographica Section D: Biological Crystallography. 1998 Nov 1;54(6):1078-84.
  • O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research. 2016 Jan 4;44(D1):D733-45.
  • Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL. NCBI BLAST: a better web interface. Nucleic acids research. 2008 Apr 24;36(suppl_2):W5-9.
  • Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods. 2012 Feb;9(2):173-5.
  • Thompson JD, Gibson TJ, Higgins DG. Multiple sequence alignment using ClustalW and ClustalX. Current protocols in bioinformatics. 2003 Jan(1):2-3.
  • Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997 Sep 1;25(17):3389-402.
  • Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology. 2017 Nov;35(11):1026-8.
  • Laine E, Karami Y, Carbone A. GEMME: a simple and fast global epistatic model predicting mutational effects. Molecular biology and evolution. 2019 Nov 1;36(11):2604-19.
  • Hopf TA, Ingraham JB, Poelwijk FJ, Schärfe CP, Springer M, Sander C, Marks DS. Mutation effects predicted from sequence co-variation. Nature biotechnology. 2017 Feb;35(2):128-35.
  • Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture mutation effects. arXiv preprint arXiv:1712.06527. 2017 Dec 18.
  • Seemayer S, Gruber M, Söding J. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics. 2014 Nov 1;30(21):3128-30.
  • Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, Abbeel P, Song YS. Evaluating protein transfer learning with TAPE. Advances in neural information processing systems. 2019 Dec;32:9689.
  • Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences. 2020 Jan 21;117(3):1496-503.
  • Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nature methods. 2019 Dec;16(12):1315-22.
  • Biswas S, Khimulya G, Alley EC, Esvelt KM, Church GM. Low-N protein engineering with data-efficient deep learning. Nature Methods. 2021 Apr;18(4):389-96.
  • Georgiev AG. Interpretable numerical descriptors of amino acid space. Journal of Computational Biology. 2009 May 1;16(5):703-23.
  • Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, Shi J, Deane CM. SAbDab: the structural antibody database. Nucleic acids research. 2014 Jan 1;42(D1):D1140-6.
  • Li J, Ma X, Li X, Gu J. PPAI: a web server for predicting protein-aptamer interactions. BMC bioinformatics. 2020 Dec;21(1):1-5.
  • Premkumar KA, Bharanikumar R, Palaniappan A. Riboflow: using deep learning to classify riboswitches with∼ 99% accuracy. Frontiers in bioengineering and biotechnology. 2020 Jul 14;8:808.
  • Zheng J, Zhang X, Zhao X, Tong X, Hong X, Xie J, Liu S. Deep-RBPPred: Predicting RNA binding proteins in the proteome scale based on deep learning. Scientific reports. 2018 Oct 15;8(1):1-9.
  • Angenent-Mari NM, Garruss AS, Soenksen LR, Church G, Collins JJ. A deep learning approach to programmable RNA switches. Nature communications. 2020 Oct 7;11(1):1-2.

Loading ...