Team:AFCM-Egypt/Software

Software

An Approach to Deep Learning-Assisted Zero-N Directed Evolution of Proteins

Summary:

Engineering enhanced versions of proteins requires complex processes that integrate various measurements. This process is often overwhelmed by challenges that are specific for each engineering case. Herein, we have developed a computational approach that integrates predictions of mutational effects on evolutionary fitness with encodings for local and global features representations of protein sequences. These features were used to design a machine learning framework in an approach to accurately predict fit variants without necessitating experimental inputs.

Figure 1. Shows our pipeline for predicting mutational effects on evolutionary fitness and representing mutants for an unsupervised learning approach to predict fit variants.

Index:

  • Prediction of Independent and Epistatic Effects of mutations.
  • Unsupervised Fitness Prediction for Zero-N Directed Evolution.
  • Adaptation of our protein engineering method to antibodies.
  • Adaptation of our protein engineering method to RNA-Binding Proteins.

Availability: Notebooks demonstrating the functions and a complete pipeline to run our software on custom proteins is now available Github

1- Prediction of Independent and Epistatic Effects of mutations

Homologous Sequences Retrieval and Evolutionary Alignments:

Methods:

In our approach to fitness prediction of single and combinatorial mutant libraries, We have based our work on multiple sequence alignments (MSA) of query proteins using different databases including; PDB (Sussman et al., 1998), RefSeq-Protein (O'Leary et al., 2016) and tools including; blastp (Johnson et al., 2008), hhblits (Remmert et al., 2012), ClustalW (Thompson et al., 2003), PSI-BLAST [Altschul et al., 1997] and inspired by Alphafold2 we used MMseqs2 (Steinegger et al., 2017). These alignments can be automatically generated via our functions or provided by the user for further data generation.

Results:

We allow different filtration levels of sequences retrieval and alignment process, dis-allowing; redundant sequences (> 99% of similarity), non-significant e-value sequences, too gapped sequences (>10%) and small coverage sequences (<75%). Subsequently, a consensus sequence is constructed from the alignment at frequency cut-off of 80% which will be used later for further computations.

Prediction of Independent (Positional) Fitness Effects of Mutations:

Methods:

For prediction of mutational effect based on positional change, we have applied a function that computes independent effect of changing amino acid X to Y at position i by computing the frequency of Y at position i within the sequence alignment from a position specific scoring matrix (PSSM) inferred from a consensus sequence, then subtracting the logarithm of observed frequency of X to Y mutation within all positions within the alignment. Finally, multiplying this by the conservation of position i as a function of all residue frequencies divided by the number of the alignment sequences bearing position i (Consi).

Eq. 1

Results:

Figure 2. shows that Δrho AFCM were positive in 75/106 runs versus GEMME (Laine et al., 2019) and 71/106 vs EVmutation (Hopf et al., 2017).

Prediction of Epistatic (Global) Fitness Effects of mutations:

Estimation of Evolutionary Couplings:

To accurately estimate epistatic global effect of mutations whether in single position deep mutational scanning and/or within combinatorial libraries of mutations, we should be able to have a quantifiable measurement of the contacts between the amino acid X at position Y with all other residues occupying other positions in sequence L. We have relied on our approach for coupling matrix generation on pseudo-likelihoods computed by a random field Markov model using CCMPred algorithm [Seemayer et al., 2014] used previously by iGEM team GO_Paris-Saclay_2020. Subsequently, these coupling matrices and features represent coupling for each amino acid in each position with all other positions occupied by all amino acid possibilities including gaps. So, we have used this data both for estimating epistatic effects of mutations and for encoding the local evolutionary data of studied proteins. These coupling data have been used for these purposes as it has been shown their potential in estimating protein consevative features and structures as well.

Method A: Prediction of Epistatic Effect using evolutionary distances:

In this approach we postulated that the epistatic effect of a mutation Y at position i could be determined as a function of the minimum fitness change within all other residues in sequence of length L required to observe such a variant in a similar fashion to the work done by (Laine et al., 2019). This can be determined by measuring how far in the evolutionary alignment we should go to observe mutation Y at that position i. This has been computed via Biopython package phylogenetic distance calculator between sequences bearing Y and the query sequence Q of the user. Then, the minimum value is considered as the minimum evolutionary fitness change. This computation can be further refined by multiplying this distance with the magnitude M of i couplings with other residues in sequence L.

Eq. 2

Method B: Prediction of Epistatic Effects using Residue Frequencies:

However, the pairwise fitness effect of mutation X to Y at i on other residues occupying rest of positions in a sequence of length L could be a representation of the probability of observing evolutionary sequences bearing residue Y at position i versus the probability of observing mutants with X at i. This definition could be also improved if we considered the magnitude of contact effect between position i and rest of positions in sequence L. This magnitude M could be calculated if we have a representation vector for the pairwise coupling potentials of i with other residues in sequence L.

Eq. 3

Method C: Prediction of Epistatic Effects using Residue Couplings:

This method makes use of the evolutionary couplings computed using CCMPred. In which the total energy function of a sequence equals the sum of all singleton and pairwise couplings of that sequence (Equation 4). So to estimate the epistatic effects upon introduction of single or combinatorial mutations, This can be computed by subtracting the energy function of the query sequence So from the energy function of the mutant sequence S (Equation 5). For dimensionality reduction purposes and particularly for large proteins -long sequence- one of the approaches applied is to shorten the long vector to represent only positions where I is mutated in the combinatorial library. Assume 4 positions are allowed to mutate, the length of the vector will be (4 * L) + Ls where Ls represents the singletons couplings. However, the aforementioned approach was proven not to be as effective as total representations, however it can be helpful for time-efficient screening of mutants.

Eq. 4
Eq. 5
Figure 3. shows that Δ rho AFCM were positive in 69/106 runs versus GEMME (Laine et al., 2019) and 86/106 vs EVmutation (Hopf et al., 2017).
Figure 4. shows that experimenting the three methods showed that method C is most sensitive to representations of evolutionary fitness amongst the proposed methods.
Figure 5. shows that Δrho AFCM were positive in 91/106 runs versus GEMME (Laine et al., 2019) and 79/106 vs DeepSequence (Riesselman et al., 2017).

Discussion:

The aforementioned functions not only can help us to estimate fitness of mutants and optimize navigation through greedy search of mutational landscapes, but they can also help us to estimate other evolutionary relevant fitness-related characteristics such as escape mutations for viruses, which are essentially mutations of acceptable viability but with marked fitness effect.

In addition, mutations of positive fitness can be evolutionary prohibited if they lead to compromised viability below an acceptable level, so these mutations might not be observed in nature for sake of a higher order biological system integrity, however, using computational approaches and for artificial systems of protein engineering and cell-free protein synthesis, these mutations can be captured for their high impact on fitness without fear to compromise the biological system, an advantage that might not be available through global evolutionary dynamics. This approach should be further supported by more and more validations for mutational effects and fitness predictions using more testing datasets. We would like also to compare the time complexity of our software for mutational effect prediction versus other algorithms.

Assessment of Physicochemical Parameters for protein ecodings

Methods:

Physicochemical parameters are very important for proper representation of proteins for machine learning tasks, this is due to the relationship between these parameters with functional and structural activity of the protein of interest. The AAindex database incorporates hundreds of physicochemical indices representing amino acids according to the measured index retrieved from experimental literature. However, these hundreds of indices would be dimensionally unfeasible to incorporate within one encoding vector. Thus, a dimensionality reduction approach has been applied by (Georgiev, 2009). through principal component analysis (PCA) of more than 500 indices of AAindex database resulting in 19 principal components representing these features.

We have worked to create functions that can encode proteins and/or compute their component values to be used for representation of studied proteins and inference of important physicochemical features for mutant sequences.

Availability: All the matrices and functions are available through our software notebooks.

Prediction of Local Evolutionary contexts for mutant sequences

Methods:

Local evolutionary features are essential indicators for activity, selectivity and function of homologous sequences of protein families. Herein, we have applied a function to encode protein sequence using data inferred from the coupling matrices and raw features of pairwise interaction generated by pseudolikelihood values from Markov random field model implemented through CCMPred. For this approach, each sequence is represented by a long vector resulting from flattening a matrix -for sequence of length L- with dimensions L by L+1, where the single site couplings are also added for each residue S at position i, and each residue is represented by a vector of length L+1. For dimensionality reduction purposes and particularly for large proteins -long sequence- one of the approaches applied is to shorten the long vector to represent only positions where I is mutated in the combinatorial library. Assume 4 positions are allowed to mutate, the length of the vector will be (4 * L) + Ls where Ls represents the singleton couplings .

Eq. 5

Prediction of Global Evolutionary contexts for mutant sequences

Methods

Global evolutionary contexts are essential features when it comes to protein language models such as TAPE (Rao et al., 2019), trRosseta (Yang et al., 2020) and UniRep [Alley et al., 2019]. These are self-supervised models trained on huge datasets of protein sequence and function annotations. Thus, they are able to learn non-trivial features essential for the process of training deep learning models using protein sequences. However, these models lack local evolutionary contexts. Thus, fine-tuning and retraining these models can enhance their performance over selected protein sequences of diverse functions, a similar approach has been tried to improve UniRep resulting in eUniRep model (Biswas et al., 2021) which is selectively fine-tuned for homologues of desired protein sequence.

In our approach, we decided to use the power of these global protein representation models by integrating these features with the local context encoder so that each sequence is represented by essentially three elements including; physicochemical parameters, local evolutionary contexts and global evolutionary contexts. Among these models we found that TAPE transformer and Unirep encoders might provide maximum feature representation advantages for our model. Of course there might be some considerations for the length of the encoding vectors, so dimensionality reduction techniques might be needed to represent these sequences with optimum computational complexity and without compromising essential features. This approach can help generalize our approach for sequences with diverse complexity in terms of sequence lengths.

Discussions and limitations

The aforementioned approaches have been used not only to validate our method but also to better characterize genetic-engineering related protein within iGEM registry and/or in our engineering designs. This aims to characterize their mutational landscapes and select best fit mutants to perform desired functions in a specific context. This is a primary step to crowdsource protein engineering for iGEM teams. This approach should be further supported by more and more validations for mutational effects and fitness predictions using more testing datasets. We would like also to compare the time complexity of our software for mutational effect prediction versus other algorithms.

Availability: Notebooks demonstrating the functions and a complete pipeline to run our software on custom proteins is now available on Github

Loading ...