The relationship between the promoter sequence and its strength has been a difficulty that scientists have wanted to explore clearly for a long while. At present, the most reliable method is still to use experimental verification, using the reporter to characterize the strength of the promoter. But such experiments are time-consuming and may even require gene synthesis to measure the strength of the promoter variants, making them less feasible for high-throughput experiments.
Therefore, we expect to have a software that can measure its strength directly through the sequence, which can greatly reduce the workload and experiment difficulty and promote biological research to be more convenient.
We are well aware that this is very difficult. Given the complexity and explosion of information resulting from genetic diversity, the relationship between sequence and strength must not be easily described in a simple linear fashion. But machine learning has developed rapidly in recent decades, which is suitable for various non-linear questions, with a variety of algorithms and even the use of text recognition to achieve simultaneous translation, a previously unimaginable task.
Figure 1. RNN working schematic diagram
Recurrent neural network (RNN) is a machine learning algorithm suitable for text learning (Fig.1). In a word, RNN has the function of “Memory”, which is suitable for some sequences with correlation before and after. RNN takes into account the implicit information provided by the previously read characters when reading the input of each character in the sequence, so RNN is very effective for the data with sequence characteristics, it can mine the timing information and semantic information.
RNN is a kind of supervised learning, so each input should have its exact y value as the monitor when training the model, so as to calculate the loss value and performance stochastic gradient descent. Because machine learning requires large amounts of data for model training, large quantities of characterization data are usually obtained by high throughput experiments in laboratories. Since we could not find a dataset large enough that met the training requirements in our project or previous experiments in our laboratory, data from literature that can be obtained directly from the NCBI was selected. Depending on the substrate used in that research, one of the datasets was selected and divided into three groups, and we selected the available dataset with glucose substrate that was the same as in our laboratory.
Due to the time constraint, we were unable to make complete and thoughtful overparameter adjustments to optimize the results. The data set we used was extraordinarily enormous. There were totally 967,285 promoter variants, each has 110 bp in length, resulting in a seriously long training period. The original authors used a two-reporter homologous gene system for expressing constitutive red fluorescent protein (RFP) and variable Yellow fluorescent protein(YFP), using Flow cytometry to measure log (YFP/RFP) as quantitative characterization of the promoter. Thus, the promoter library was obtained.
Figure 1. learning_rate=0.05
Figure 2. learning_rate=0.005
Figure 3. learning_rate=0.0005
According to the loss value of the results, the training results were not as good as our expect. We have made several parameter adjustments that can not make a significant change in this result. After analysis, we think that the loss value has a certain degree of decline. Still, the gene complexity causes the underlying information not to be fully learned by RNN, so the loss value fluctuates repeatedly.
We have tested the learning rate set as 0.1, 0.5, 0.05, 0.005, 0.0005, 0.00005, and only when it was set as 0.005 the loss curve shows a relatively obvious decline tendency. And the number of hidden units was tested as well, while setting as 256 gains a better result compare with 128 and 512.After testing the whole dataset, computing the accuracy all the time, we can found that when data consume to half of them, the model won't learn new things. Therefore, we retrain the model with half of the dataset, and save the whole model parameters as '.pt' file for the future teams to use, which could also be found in our Github repository.
Figure 4. Training set accuracy
You can get our complete data and machine learning code on our Github for repeatable experimentation and optimization. You can find hyperparameters at the beginning of it and optimize them, or change the structure of the neural network. We didn’t have enough time to optimize in the last month because of the amount of relationship training, but the subsequent teams were able to keep trying and mining.
For the convenience of users, only the python environment is needed for RNN training.
Besides using RNNs purely, there have been attempts by researchers to use a combination of RNNs, CNNs, GANs, and other neural networks, which can also be tried for constructing more complex neural networks to learn deeper and more obscure effects.
L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989, doi: 10.1109/5.18626.
Xi, L., Fondufe-Mittendorf, Y., Xia, L. et al. Predicting nucleosome positioning using a duration Hidden Markov Model. BMC Bioinformatics 11, 346 (2010). https://doi.org/10.1186/1471-2105-11-346
 Curran, K., Crook, N., Karim, A. et al. Design of synthetic yeast promoters via tuning of nucleosome architecture. Nat Commun 5, 4002 (2014). https://doi.org/10.1038/ncomms5002
de Boer, C.G., Vaishnav, E.D., Sadeh, R. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol 38, 56–65 (2020). https://doi.org/10.1038/s41587-019-0315-8.