Modeling : Machine Learning in Directed Evolution for Improving the Thermal Stability of Cellulase

Criteria:

Use modeling to gain insight into how your project works or should be implemented. Explain your model's assumptions, data, parameters, and results in a way that anyone could understand.

Some ways to achieve this include:

1. Deterministic, exploratory, molecular dynamic, and/or stochastic models

2. Explore the physical modeling of a single component within a system

3. Utilize mathematical modeling for predicting function of a more complex device

Notes:

- This could be either a new model you develop or the implementation of a model from a previous team.

- The work should be substantial and show excellence.

Background

Biomass conversion is an important strategy to capture energy effectively, which requires the continuous optimization of enzymes with adaptable activity, selectivity, stability or tolerance to an unnatural environment. This is currently achieved by directed evolution through iterative experimental rounds of evolutionary mutation, selection, and amplification. However, this method is still facing challenges when processing the tremendous amounts of data generated from protein mutagenesis libraries. Recently, machine learning approaches have been applied in directed evolution effectively to understand the relationship between protein sequences and properties. In this work, machine learning predictive models were built for improving the thermal stability of cellulase on a public dataset, based on four machine learning algorithms (ridge regression, support vector regression, random forests and convolutional neural network) and three protein descriptors (one-hot encoding vector, Atchley factor vector and protein embedding vector). The predicting performance was then evaluated using two common performance metrics for regression.

Method:

Dataset:

The input dataset used in this work was downloaded from ProtaBank database (https://www.protabank.org/study_analysis/AJSM4jY7/). Carlin et al. performed modelling of Michaelis-Menten kinetic constant and thermal stability measurements for variants of beta-glucosidase B protein expressed in E. coli. Thermal stability parameter Tm, defined as the temperature at which half of the protein molecules melt after heat challenge, was detected using thermal denaturation assay(Carlin et al., 2017). The experimental dataset was submitted to Protabank in 2018 under the submission ID AJSM4jY7.

Protein Encoding:

Machine learning requires the inputs to be vectors or matrices. Protein variants cannot be recognized directly by machine learning models. The information on protein sequences, structures or function features, must be represented in the form of vectors or matrices in order to be recognized to build models. That information can be defined as protein descriptors. The conversion from protein sequences to protein descriptors truly affects the ability of machine learning models to learn. Here in this work, three protein encoding methods were used to build protein descriptors.

**Figure 1.**Approach for Constructing a One-Hot Encoding Vector (OEV). A L-length protein sequence was encoded into a L*20 matrix by one-hot encoding, and then it was translated directly to a 1-dimensional L*20-length vector.

**Figure 2.**Approach for Constructing an Atchley Factor Vector (AFV). A L-length protein sequence was encoded into a L*5 matrix using Atchley factor, and then it was translated directly to a 1-dimensional L*5-length vector.

**Figure 3.**Approach for Constructing a Protein Embedding Vector (PEV). A L-length protein sequence was split into L-2 3-mers, and each 3-mer was encoded into a 64-length vector using protein embedding method. Then the average of all 64-length vectors was calculated to be used to represent a protein sequence.

Algorithm:

Here in this work, four machine learning algorithm were used: Ridge Regression (RR), Support Vector Regression (SVR), Random Forests (RF) and Convolutional Neural Network (CNN). For each machine learning algorithm above, a set of hyperparameters were specified for tuning procedure. The hyperparameters were selected to be expected to have significant impact on the performance of model. Grid search was performed to find the optimal values of different hyperparameters. Once hyperparameters were determined to be optimal, the final model was trained on the training set using these hyperparameters.

**Figure 4.**Architecture of the 1D Convolutional Neural Network Model. CONV, the convolution layer; POOL, the pooling layer; FC, the fully connected layer. The size of the output variable of each layer was shown in the bottom. Key parameters of convolution layers and pooling layers were also included, e.g. the convolution layer 1 used 32 2*1 filters.

**Table 1.** Hyperparameters Settings for Tuning Procedure.

Result:

Two performance metrics, mean squared error (MSE) and coefficient of determination (R-Squared), were calculated from the testing set predictions and true values over 12 machine learning predictive models. For models with variability of random seeds (RF and CNN), the statistic values were the mean of MSE or R-Squared of 100 runs of the model, with the standard deviation of MSE or R-Squared included in parentheses.

**Table 2.** MSE of Testing Set Predictions from Models Trained with Different Protein Descriptors and Machine Learning Algorithms.

**Table 3.** R-Squared of Testing Set Predictions from Models Trained with Different Protein Descriptors and Machine Learning Algorithms.

**Figure 5.**Heatmap of MSE of Different Protein Descriptors and Machine Learning Algorithms. The MSE value was also shown in each block. The brighter color indicated the lower value, while the darker color indicated the higher value.

**Figure 6.**Heatmap of R-Squared of Different Protein Descriptors and Machine Learning Algorithms. The R-Squared value was also shown in each block. The red color indicated the positive value, while the green color indicated the negative value and the shades of the color also indicated the levels of the value.

Among 12 machine learning predictive models, the convolutional neural network model with Atchley Factor Vector, had the lowest MSE, highest R-squared, showing that it was the best model to be used to predict the thermal stability of cellulase across this dataset.

Conclusion:

This work demonstrated that combining machine learning methods and directed evolution approaches was an efficient strategy for improving the thermal stability of cellulase. Predictive models built by machine learning methods are able to learn relationships between sequences and properties from directed evolution experiments and help guide the evolution of enzyme, with information from iterative rounds of prediction and validation. Among four different machine learning algorithms and three different protein descriptors, the convolutional neural network model with Atchley factor vector performed best in predicting the thermal stability of cellulase.

Further Work:

As a part of iGEM 2021 Team Edinburgh project, a simple machine learning workflow used in this work has been outlined for the construction of machine learning predictive models with recommended different algorithms and protein descriptors, which can also be a paradigm for further studies. The machine learning predicting methods can be an effective way to be applied in the process of selection in directed evolution.

References:

1. CARLIN, D. A., HAPIG-WARD, S., CHAN, B. W., DAMRAU, N., RILEY, M., CASTER, R. W., BETHARDS, B. & SIEGEL, J. B. 2017. Thermal stability and kinetic constants for 129 variants of a family 1 glycoside hydrolase reveal that enzyme activity and stability can be separately designed. PLoS One, 12, e0176255.

Back to the top

Next page >>

Team:Edinburgh/Model