Modeling : Machine Learning in Directed Evolution for Improving the Thermal Stability of Cellulase
Criteria:
Use modeling to gain insight into how your project works or should be implemented. Explain your model's assumptions, data, parameters, and results in a way that anyone could understand.
Some ways to achieve this include:
1. Deterministic, exploratory, molecular dynamic, and/or stochastic models
2. Explore the physical modeling of a single component within a system
3. Utilize mathematical modeling for predicting function of a more complex device
Notes:
- This could be either a new model you develop or the implementation of a model from a previous team.
- The work should be substantial and show excellence.
Background
Biomass conversion is an important strategy to capture energy effectively, which requires the continuous optimization of enzymes with adaptable activity, selectivity, stability or tolerance to an unnatural environment. This is currently achieved by directed evolution through iterative experimental rounds of evolutionary mutation, selection, and amplification. However, this method is still facing challenges when processing the tremendous amounts of data generated from protein mutagenesis libraries. Recently, machine learning approaches have been applied in directed evolution effectively to understand the relationship between protein sequences and properties. In this work, machine learning predictive models were built for improving the thermal stability of cellulase on a public dataset, based on four machine learning algorithms (ridge regression, support vector regression, random forests and convolutional neural network) and three protein descriptors (one-hot encoding vector, Atchley factor vector and protein embedding vector). The predicting performance was then evaluated using two common performance metrics for regression.
Method:
Dataset:
The input dataset used in this work was downloaded from ProtaBank database (https://www.protabank.org/study_analysis/AJSM4jY7/). Carlin et al. performed modelling of Michaelis-Menten kinetic constant and thermal stability measurements for variants of beta-glucosidase B protein expressed in E. coli. Thermal stability parameter Tm, defined as the temperature at which half of the protein molecules melt after heat challenge, was detected using thermal denaturation assay(Carlin et al., 2017). The experimental dataset was submitted to Protabank in 2018 under the submission ID AJSM4jY7.
Protein Encoding:
Machine learning requires the inputs to be vectors or matrices. Protein variants cannot be recognized directly by machine learning models. The information on protein sequences, structures or function features, must be represented in the form of vectors or matrices in order to be recognized to build models. That information can be defined as protein descriptors. The conversion from protein sequences to protein descriptors truly affects the ability of machine learning models to learn. Here in this work, three protein encoding methods were used to build protein descriptors.
Algorithm:
Here in this work, four machine learning algorithm were used: Ridge Regression (RR), Support Vector Regression (SVR), Random Forests (RF) and Convolutional Neural Network (CNN). For each machine learning algorithm above, a set of hyperparameters were specified for tuning procedure. The hyperparameters were selected to be expected to have significant impact on the performance of model. Grid search was performed to find the optimal values of different hyperparameters. Once hyperparameters were determined to be optimal, the final model was trained on the training set using these hyperparameters.
Result:
Two performance metrics, mean squared error (MSE) and coefficient of determination (R-Squared), were calculated from the testing set predictions and true values over 12 machine learning predictive models. For models with variability of random seeds (RF and CNN), the statistic values were the mean of MSE or R-Squared of 100 runs of the model, with the standard deviation of MSE or R-Squared included in parentheses.
Among 12 machine learning predictive models, the convolutional neural network model with Atchley Factor Vector, had the lowest MSE, highest R-squared, showing that it was the best model to be used to predict the thermal stability of cellulase across this dataset.
Conclusion:
This work demonstrated that combining machine learning methods and directed evolution approaches was an efficient strategy for improving the thermal stability of cellulase. Predictive models built by machine learning methods are able to learn relationships between sequences and properties from directed evolution experiments and help guide the evolution of enzyme, with information from iterative rounds of prediction and validation. Among four different machine learning algorithms and three different protein descriptors, the convolutional neural network model with Atchley factor vector performed best in predicting the thermal stability of cellulase.
Further Work:
As a part of iGEM 2021 Team Edinburgh project, a simple machine learning workflow used in this work has been outlined for the construction of machine learning predictive models with recommended different algorithms and protein descriptors, which can also be a paradigm for further studies. The machine learning predicting methods can be an effective way to be applied in the process of selection in directed evolution.
References:
1. CARLIN, D. A., HAPIG-WARD, S., CHAN, B. W., DAMRAU, N., RILEY, M., CASTER, R. W., BETHARDS, B. & SIEGEL, J. B. 2017. Thermal stability and kinetic constants for 129 variants of a family 1 glycoside hydrolase reveal that enzyme activity and stability can be separately designed. PLoS One, 12, e0176255.
Back to the top