We wanted to develop a machine learning model that was capable of generating novel amino acid sequences. While conducting our research, we discovered that the 2019 Toronto iGEM team attempted to do this via a recurrent neural network (RNN) and the UniRep model, a deep learning model for protein engineering. Their attempt at generating these sequences was successful, but we had a hard time evaluating their output and finding which sequences were viable for validation.
While deciding to review some of the errors in their pipeline, we also wished to improve the model’s workflow by increasing the number of sequences in the training dataset for the RNN. We also opted to make the output more selective for the end user by allowing the user to choose the number of sequences they would like to be generated. These pipeline reworks led us to pursue a new avenue of contribution, which was the development of a Python package. The package would essentially allow us to rework the pipeline’s use cases and make it applicable to more projects beyond our own.
With the development of this package, the pipeline’s functionality is expanded; it is fully customizable. End users can use this package for their own projects by using their own training data, filters, and stability indices, allowing them to adapt the pipeline to work with any other enzyme or protein different from PETase.
The package is available here.
We believe that a description of our package’s contents would be able to explain how our project’s pipeline works and how it can be implemented for various means, whether it be for our proposed end users or for people aspiring to work with other enzymes to improve their catalytic activity. Our package aims to help end users with their research in enzymatic activity in hopes of possibly optimizing sequences and improving substrate degradation.
The goal of this package is to be able to return an ideal set of mutant candidates for a protein sequence that passes specified filters that are predetermined and designed by the end user. The parameters that the package follows for the sequence can be determined entirely by the end user specifically for the protein or enzyme they are working with. Below is a rundown of the specific functions in the package, as well as their relevance to our project.
construct_diwv()
Function that creates a stability index dictionary for amino acids.
This function creates a dictionary of stability indices based on different amino acid pairs. The idea behind this is to assess sequence stability and whether it can be created from inducing mutations to the wild-type sequence. To do this, a matrix of instability values was adapted from a paper from Guruprasad, K., Reddy, B.V.B. and Pandit, M.W. (1990). This matrix is included in the package for users to implement (labeled “diwv.csv”), but users are also welcome to employ their own set of values if they wish, to construct their own dictionary of indices.
passes_filters()
Function that pre-determines the filters that the RNN will run the generated sequences through.
This function includes the source code for any filters one would like to use for certain enzyme or protein properties, such as sequence stability, length, catalytic activity, or thermostability. The filters for sequence stability (based off of the dictionary created from construct_diwv) and sequence length are provided in the source code. However, the end user is welcome to employ or create their own filters for any enzymatic properties they are specifically researching for their sequence. This function will only need to run if you choose to modify the included filters and/or add your own filters.
run_rnn()
Function that creates the machine learning model that will generate your sequences.
This is the main function that will take in the training dataset, learn from it, and generate the best possible mutant candidates after sorting them through the filters established in passes_filters(). The function construct_diwv() must be run before this function, so run_rnn() has an index to work off of when going through the filters listed in passes_filters().
This function has the most adaptability for one’s workflow, as essentially all of its parameters can be changed to suit the end user’s needs for their sequences, whether it be the sequence length to be analyzed, the number of epochs for training, the number of hidden units within the recurrent neural network, or even the number of sequences you would like to be generated. Included in the package are the parameters our team has used to generate our PETase mutant candidates.
These three functions are important as they work together to generate sequences for output. We hope that by implementing this package into their workflows, end users may find some potential in this package for sequence optimization and improvement.