Proof Of Concept
Immune response profile of the vaccine
In order to simulate the immune response of the VLP-loaded epitopes of our designed vaccine for TNBC, we used C-IMMSIM (https://kraken.iac.rm.cnr.it/C-IMMSIM/) to anticipate the appropriate number of injections considered to have a promising tumour eradication potential based on the prediction of the immune-specific interactions resulted from variable doses of injection to characterize the immune response.
From these results, we can conclude that only one injection dose is indicated to achieve targeted immune response without the need for booster doses, as no significant or remarkable difference in the immune response is observed by additional doses in terms of immunoglobulins isotypes, IgM and IgG titres, Cytotoxic T-cells stimulation, interleukins (IL-2, IL-4, IL-6 and IL-10) and cytokines (IFN-γ response) concentration, activation of antibody secreting plasma B-cells, as well as macrophage, dendritic and epithelial cell populations. However, the only parameter in favour of the triple injection was the tripled level of active T-helper cells which reached approximately 9000 cells/mm3. Accordingly, one injection dose of the vaccine seems to be efficient in terms of immune response profile characterization.
An Approach to Deep Learning-Assisted Zero-N Directed Evolution of Proteins
Summary:
Engineering enhanced versions of proteins requires complex processes that integrate various measurements. This process is often overwhelmed by challenges that are specific for each engineering case. Herein, we have proved that algorithms for mutationbal effect prediction can accurately coreelate with fit variants valdiated by expeirmental datasets for scanning mutatgenisis both for indepednent and epistaitic mutational effects.
Index:
- Prediction of Independent and Epistatic Effects of mutations.
- Unsupervised Fitness Prediction for Zero-N Directed Evolution.
- Adaptation of our protein engineering method to antibodies.
- Adaptation of our protein engineering method to RNA-Binding Proteins.
Availability: Notebooks demonstrating the functions and a complete pipeline to run our software on custom proteins are now available on Github
1- Prediction of Independent and Epistatic Effects of mutations
Homologous Sequences Retrieval and Evolutionary Alignments:
Methods:
In our approach to fitness prediction of single and combinatorial mutant libraries, We have based our work on multiple sequence alignments (MSA) of query proteins using different databases including; PDB (Sussman et al., 1998), RefSeq-Protein (O'Leary et al., 2016) and tools including; blastp (Johnson et al., 2008), hhblits (Remmert et al., 2012), ClustalW (Thompson et al., 2003), PSI-BLAST [Altschul et al., 1997] and inspired by Alphafold2 we used MMseqs2 (Steinegger et al., 2017). These alignments can be automatically generated via our functions or provided by the user for further data generation.
Results:
We allow different filtration levels of sequences retrieval and alignment process, dis-allowing; redundant sequences (> 99% of similarity), non-significant e-value sequences, too gapped sequences (>10%) and small coverage sequences (<75%). Subsequently, a consensus sequence is constructed from the alignment at frequency cut-off of 80% which will be used later for further computations.
Prediction of Independent (Positional) Fitness Effects of Mutations:
Methods:
For prediction of mutational effect based on positional change, we have applied a function that computes independent effect of changing amino acid X to Y at position i by computing the frequency of Y at position i within the sequence alignment from a position specific scoring matrix (PSSM) inferred from a consensus sequence, then subtracting the logarithm of observed frequency of X to Y mutation within all positions within the alignment. Finally, multiplying this by the conservation of position i as a function of all residue frequencies divided by the number of the alignment sequences bearing position i (Consi).
Results:
Prediction of Epistatic (Global) Fitness Effects of mutations:
Estimation of Evolutionary Couplings:
To accurately estimate epistatic global effect of mutations whether in single position deep mutational scanning and/or within combinatorial libraries of mutations, we should be able to have a quantifiable measurement of the contacts between the amino acid X at position Y with all other residues occupying other positions in sequence L. We have relied on our approach for coupling matrix generation on pseudo-likelihoods computed by a random field Markov model using CCMPred algorithm [Seemayer et al., 2014] used previously by iGEM team GO_Paris-Saclay_2020. Subsequently, these coupling matrices and features represent coupling for each amino acid in each position with all other positions occupied by all amino acid possibilities including gaps. So, we have used this data both for estimating epistatic effects of mutations and for encoding the local evolutionary data of studied proteins. These coupling data have been used for these purposes as it has been shown their potential in estimating protein consevative features and structures as well.
Method A: Prediction of Epistatic Effect using evolutionary distances:
In this approach we postulated that the epistatic effect of a mutation Y at position i could be determined as a function of the minimum fitness change within all other residues in sequence of length L required to observe such a variant in a similar fashion to the work done by (Laine et al., 2019). This can be determined by measuring how far in the evolutionary alignment we should go to observe mutation Y at that position i. This has been computed via Biopython package phylogenetic distance calculator between sequences bearing Y and the query sequence Q of the user. Then, the minimum value is considered as the minimum evolutionary fitness change. This computation can be further refined by multiplying this distance with the magnitude M of i couplings with other residues in sequence L.
Method B: Prediction of Epistatic Effects using Residue Frequencies:
However, the pairwise fitness effect of mutation X to Y at i on other residues occupying rest of positions in a sequence of length L could be a representation of the probability of observing evolutionary sequences bearing residue Y at position i versus the probability of observing mutants with X at i. This definition could be also improved if we considered the magnitude of contact effect between position i and rest of positions in sequence L. This magnitude M could be calculated if we have a representation vector for the pairwise coupling potentials of i with other residues in sequence L.
Method C: Prediction of Epistatic Effects using Residue Couplings:
This method makes use of the evolutionary couplings computed using CCMPred. In which the total energy function of a sequence equals the sum of all singleton and pairwise couplings of that sequence (Equation 4). So to estimate the epistatic effects upon introduction of single or combinatorial mutations, This can be computed by subtracting the energy function of the query sequence So from the energy function of the mutant sequence S (Equation 5). For dimensionality reduction purposes and particularly for large proteins -long sequence- one of the approaches applied is to shorten the long vector to represent only positions where I is mutated in the combinatorial library. Assume 4 positions are allowed to mutate, the length of the vector will be (4 * L) + Ls where Ls represents the singletons couplings. However, the aforementioned approach was proven not to be as effective as total representations, however it can be helpful for time-efficient screening of mutants.
Discussion:
The aforementioned functions not only can help us to estimate fitness of mutants and optimize navigation through greedy search of mutational landscapes, but they can also help us to estimate other evolutionary relevant fitness-related characteristics such as escape mutations for viruses, which are essentially mutations of acceptable viability but with marked fitness effect.
In addition, mutations of positive fitness can be evolutionary prohibited if they lead to compromised viability below an acceptable level, so these mutations might not be observed in nature for sake of a higher order biological system integrity, however, using computational approaches and for artificial systems of protein engineering and cell-free protein synthesis, these mutations can be captured for their high impact on fitness without fear to compromise the biological system, an advantage that might not be available through global evolutionary dynamics.
Assessment of Physicochemical Parameters for protein ecodings
Methods:
Physicochemical parameters are very important for proper representation of proteins for machine learning tasks, this is due to the relationship between these parameters with functional and structural activity of the protein of interest. The AAindex database incorporates hundreds of physicochemical indices representing amino acids according to the measured index retrieved from experimental literature. However, these hundreds of indices would be dimensionally unfeasible to incorporate within one encoding vector. Thus, a dimensionality reduction approach has been applied by (Georgiev, 2009). through principal component analysis (PCA) of more than 500 indices of AAindex database resulting in 19 principal components representing these features.
We have worked to create functions that can encode proteins and/or compute their component values to be used for representation of studied proteins and inference of important physicochemical features for mutant sequences.
Availability: All the matrices and functions are available through our software notebooks.
Prediction of Local Evolutionary contexts for mutant sequences
Methods:
Local evolutionary features are essential indicators for activity, selectivity and function of homologous sequences of protein families. Herein, we have applied a function to encode protein sequence using data inferred from the coupling matrices and raw features of pairwise interaction generated by pseudolikelihood values from Markov random field model implemented through CCMPred. For this approach, each sequence is represented by a long vector resulting from flattening a matrix -for sequence of length L- with dimensions L by L+1, where the single site couplings are also added for each residue S at position i, and each residue is represented by a vector of length L+1. For dimensionality reduction purposes and particularly for large proteins -long sequence- one of the approaches applied is to shorten the long vector to represent only positions where I is mutated in the combinatorial library. Assume 4 positions are allowed to mutate, the length of the vector will be (4 * L) + Ls where Ls represents the singleton couplings .
Prediction of Global Evolutionary contexts for mutant sequences
Methods
Global evolutionary contexts are essential features when it comes to protein language models such as TAPE (Rao et al., 2019), trRosseta (Yang et al., 2020) and UniRep [Alley et al., 2019]. These are self-supervised models trained on huge datasets of protein sequence and function annotations. Thus, they are able to learn non-trivial features essential for the process of training deep learning models using protein sequences. However, these models lack local evolutionary contexts. Thus, fine-tuning and retraining these models can enhance their performance over selected protein sequences of diverse functions, a similar approach has been tried to improve UniRep resulting in eUniRep model (Biswas et al., 2021) which is selectively fine-tuned for homologues of desired protein sequence.
In our approach, we decided to use the power of these global protein representation models by integrating these features with the local context encoder so that each sequence is represented by essentially three elements including; physicochemical parameters, local evolutionary contexts and global evolutionary contexts. Among these models we found that TAPE transformer and Unirep encoders might provide maximum feature representation advantages for our model. Of course there might be some considerations for the length of the encoding vectors, so dimensionality reduction techniques might be needed to represent these sequences with optimum computational complexity and without compromising essential features. This approach can help generalize our approach for sequences with diverse complexity in terms of sequence lengths.
Discussion
The aforementioned approaches have been used not only to validate our method but also to better characterize genetic-engineering related protein within iGEM registry and/or in our engineering designs. This aims to characterize their mutational landscapes and select best fit mutants to perform desired functions in a specific context. This is a primary step to crowdsource protein engineering for iGEM teams.
Availability: Notebooks demonstrating the functions and a complete pipeline to run our software on custom proteins are now available on Github
Unsupervised Fitness Prediction for Zero-N Directed Evolution
Introduction:
Aiming to create a zero-N platform for protein engineering that can be accustomed to virtually any protein, we decided to build two different models that can be fit for this task. Both networks should be able to learn from unlabeled data, as most protein fitness engineering tasks are overwhelmed by the necessity of fitness data. So far machine learning models have been successful to reduce the number of required experimental fitness data points for training a satisfactory computational model. However, still supervised methods largely outperform unsupervised counterparts.
Methods:
For this model, deep neural networks have been constructed using 1-dimensional convolutional neural networks (CNN) and bidirectional long short-term memory (BiLSTM) forms of recurrent neural networks (RNN). These models receive the encodings of protein sequences as inputs to 1dCNN layers that perform the feature extraction process helping the subsequent BiLSTMs to identify the semantic sequences within these features. Subsequently, neural network layers are added and processed through regularization and drop outs. Finally, Dense layers are added and their final states are considered into a probability softmax function to predict fitness of mutant sequence. However, to be able to quantify the variational change of a mutant sequence, The outputs of the BiLSTM layers are summed and compared to the query sequence. This technique could help assess different biological variations within a specific sequence library of a protein. For instance, it can be used to assess antigenic variations, changes in secondary structure and escape mutations of viral proteins. This design can also be adapted for prediction of mutations that can increase mutant fitness but are undesirable due to global cellular constraints. The aforementioned approach can be done in a comparative approach for global embeddings of mutant versions and parent sequences.
Due to time constraints and to simulate selective pressure application as a directed function, we decided to prepare a primary network following that design and compare their regressive probabilities as am auxiliary binary classification task in terms of fit (1) or unfit (0) in relation to wild type sequence for both protein datasets of GB1 (protein G domain 1) at cutoff of 1 for binding IgG-Fc and sensor protein PhoQ at cutoff of 3.287447 for phosphatase activity enrichment.
Results
Clustering model for mutant fitness prediction:
Methods
In this approach we applied a k-means clustering algorithm to cluster encoded protein sequences into at least two main clusters including; fit and unfit mutants. This approach is of great importance as most mutant variants provide noise to the training data. This noise is reflected on the low fitness generated by most of the mutants in the library. Thus, a strict clustering approach is needed to reduce the noise by approximating fit and unfit mutants to distinct virtual centers for clustering purposes. The elbow method was implemented to determine the optimum number of clusters. In case more than 2 clusters were provided, cluster integration was then performed to adapt to unbalancing within the training data, as most of the mutants are unfit. The clustering labels were compared with fitness values which were converted to binaries in relation to wild type sequence fitness.
Results
Overall, The Unsupervised neural network showed relatively better performance than K-means clustering, however future approaches combining both strategies could be best for better performance.
Future Perspective:
In the near future, we would like to further test and refine our algorithm through more validation datasets and comparisons with recently developed tools. There is also a potential space for improving our neural network architecture and comparing different architectures. In addition, we would like to analyze and compare time-complexities and computational constraints of long sequences on our approach.