Currently, the cleavage sites of Nattokinase remain unelucidated. The main goal of this project is to predict the cleavage sites of Nattokinase. The past research for protease-substrate relationship is highly related to its folding structures. However, learning the features of the folding structures is computationally expensive.

Our contribution is twofold: first, we establish a novel model and design the algorithm for S8-family protease-substrate relationship by only using sequence data and protease properties; second, we leverage our method to predict the cleavage sites of Nattokinase.

Our model has great performances on predicting S8-family protease-substrate relationships and also significantly mitigating the computational cost. In addition, it provides the potential cleavage sites of Nattokinase that are worth experimenting and verifying. Last but not least, anyone can reach our codes and application to predict the cleavage sites of the S8 family protease. Our project not only is an efficient model but also analyzes the protease-substrate relationship precisely.
Currently, the cleavage sites of Nattokinase remain unelucidated. And some of the clinical research indicates that Nattokinase may have multiple pharmacologic actions, such as antithrombotic activity, antihypertensive, anti-atherosclerotic, lipid-lowering, and neuroprotective actions [1]. This fact implies a possibility of the existence of a set of human proteins that can be the substrates of Nattokinase and result in the aforementioned pharmacologic actions. Our motivation is that we want to identify what the cleavage sites of Nattokinase are and to identified possible substrates that can interact with Nattokinase.

Since Nattokinase is one of the S8-family proteases, we discover numerous pieces of research about predicting cleavage sites of S8-family proteases. Most of them conduct folding structures analyses. Despite several solutions for folding structure analyses, their computational cost is too expensive for common computing devices [2].

Our project provides a model for finding the hidden features in the amino acid sequences that represent the critical structure in protease. We exploit the S8-family protease sequence data in the MEROPS database [3]. We present a feature selection algorithm for protease called “Amino Acid Distance Pair” (AADP). This algorithm is trying to simplify the sequences and also represent hidden features in folding structures. We also present an improved mode of AADP, we call it “Improved AADP” (IAADP), it has much better performance than AADP. We extend the sequence data of cleavage sites to numerical data by using the Amino Acid Index database (AAindex) [4]. There are many classification models, like SVM, regression models, and various neural networks. We choose the simple linear SVM and dense neural network for classifying whether a specific protease can cleave a cleavage site or not.

Our contribution is twofold: first, we establish a novel model and design the algorithm for S8-family protease-substrate relationship by only using sequence data and protease properties. In this case, we circumvent the expensive computational costs of analyzing folding structures; second, we leverage our method to predict the potential cleavage sites of Nattokinase in fibrin that are worth experimenting and verifying. Last but not least, we develop an on-site application on google colab for everyone to reach them and predict the cleavage sites of the S8 family proteases [5, 6, 8].
Materials and Methods
Dataset Preparation
The training dataset we used is from MEROPS [3]. We select the proteases and their cleavage sites in the S8 family. There are 50 different proteases and their cleavage sites. After removing duplicate sequences and sequences with missing values. We yielded 743 cleavage sites and 1504 protease sequences. The features we used in our project are peptidase unit, active site, and amino acid sequence. The format of our data is in CSV format. We’ll introduce the format in the next paragraph.

Feature Selection
We present a feature selection algorithm for protease called “Amino Acid Distance Pair” (AADP). We transform the sequence data to numerical data by calculating the number of ordered pairs of all the combinations of amino acids and their distance in the sequence (Fig 1). In Fig 1 as an example, our data will count every pair in the sequence. The pair “AL” with distance one will count twice, and “LA” with distance two will count once (note that we consider the ordered pair), etc. The idea behind this algorithm is that some specific amino acids formed the features in folding structures (e.g., serine protease). Thus, we count the number of occurrences and find the critical amino acid combinations in the sequence.

However, the actual case is much more complex. In knowledge of biology, there are a few amino acids in the sequence that play a crucial role. Thus, all the amino acid pairs cannot be equally important. As mentioned above, we introduce “active site” and “peptidase unit” into our algorithm. These two data can tell us which amino acids are crucial and where the reaction exactly happens, respectively. Hence, we only consider the pair with those active sites and the amino acid in the range of peptidase units. As the above approach, we reduce the data size and give the data a terrific representation from the biology viewpoint. We called this algorithm “Improved AADP” (IAADP). If we take Fig 1 as an example again and we assume that the amino acid on the third position, “H”, is one of the active sites. In IAADP, we only consider the pair that “H” was involved in. We will discuss these two algorithms more later.

The data of cleavage sites is a character sequence of amino acids with length 8. The sequence of amino acids cannot express any properties of cleavage sites. Thus, we introduce the Amino Acid index database (AAIndex) [4]. For each amino acid, it contains 531 different numerical biochemistry properties. Therefore, we extend every amino acid in the cleavage site to a vector with 531 components, and every cleavage site will become a matrix with a size 531 * 8.

The final input data is that we combine each protease feature and cleavage site matrix, and the label is 0 or 1 means whether this is a positive protease-substrate relationship (1 means yes, 0 otherwise).

Fig 1. AADP example
Model Design and Experiment
The past research problems about cleavage site prediction focused on a specific protease. About the methods, they concentrated on feature selection, plus some simple classification models, e.g., SVM [7]. Despite analyzing a family of proteases (S8-family), we still consider this kind of simple model. Recently, neural networks have a superb ability to fit the data and extract the features, so we experiment with neural networks, too.

For the case of SVM, we introduce it with the linear kernel and execute the 10-fold validation during the experiment. For the case of the neural network (NN), we use two dense layers for our model (Fig 2.), set the step size to 0.0001, and run 250 epochs for each experiment, and we use the recall-precision-based loss function.

There are extremely numerous negative labels in our dataset. However, we hope our model really learns the hidden features in our data instead of guessing the answer according to the distribution of training labels. So, the true distribution in our dataset is [1066224:51248] (about 4% positive labels), and the training dataset we design is [15040:7834] (we use about 30% in our training dataset), which is far away from the true distribution. During the testing section, we use the whole dataset as the testing dataset.
Fig 2. NN model summary
In the following section, we'll discuss the performance of our model. For simplicity, we'll use TP as true positive, TN as true negative, FP as false positive, and FN as false negative. We need some convinced metrics that can evaluate our model objectively. We'll focus on the recall and precision of our model but not accuracy. We’ll explain why in the following paragraph.

The costs of conducting experiments of protease-substrate relationship are heavily expensive. Thus, we would not like to test every pair of protease and protein. If we had a prediction model, our target would be maximizing the percentage of the TP in all the positive predictions of our model. It means that we can conduct the experiments for all our positive predictions and obtain several results about the positive protease-substrate relationship. This metric is the precision of the confusion matrix, so precision is a realistic goal to maximize. Moreover, our target also wants to find all the positive protease-substrate relationships. Thus, we additionally care about the percentage of the TP in all the actual true label data. This metric is the recall of the confusion matrix, so recall is also a critical metric to our model. We don’t care about TN because it's not what we want to discover, so accuracy is a less significant metric.

Results of AADP
The results of AADP are shown in Fig 3. The left one is the result of SVM, and the right one is NN. The SVM average accuracy of k-fold validation during training session is 0.889 +/- 0.008. The most important metrics are recall and precision. We can calculate the performance by confusion matrix (Fig 3). The recall and precision of our SVM model by using AADP is 0.316 and 0.370, respectively. The accuracy we obtain is 0.944. In addition, we also calculate the F1 score by recall and precision, which can measure the performance on recall and precision equally. The F1 score for SVM is 0.341.

Next, we will show our learning curve of NN and discuss the performance of NN in the case of AADP. Since we don’t care about the accuracy, we won’t show the learning curve of accuracy, we will only show the curve of loss and precision (Fig 4). For AADP, the recall and precision of the NN model is 0.057 and 0.036, respectively. It is a really poor performance. The F1 score is 0.044. The accuracy is still 0.887, so we think our model is guessing false for every test case, and lots of TN will cause decent performance.
Fig 3. AADP Confusion matrices of SVM(left) and NN(right).
Fig 4. Learning curves of our NN by using AADP (left is the loss curve and right one is the precision curve).
Results of IAADP
The results of IAADP are shown as the confusion matrix in Fig 5. The left one is the results of SVM, and the right one is NN. Each experiment we go through five times to obtain the average performance of our model.

First, we will discuss the performance of SVM. The k-fold validation accuracy during the training session is about 0.784 +/- 0.009. We can obtain our performance metrics by confusion matrix (Fig 5.). The recall of our model is 0.188 and the precision is 0.070. The F1 score in this case is 0.102. We also calculate the accuracy of our model, which is 0.849.

Next, we discuss the performance of our NN model. Since the accuracy is less significant and the accuracy curve of our model converges to almost 100%, we won’t show the curve here. We show our learning curves as Fig 6. The loss and precision curve are almost converged to a fixed value, but recall is oscillating. We have done the experiment with larger epochs but recall keep oscillating around 0.6, so we use 250 epochs as our final parameter. As the same as above, we care about recall and precision most. The recall of our model is 0.874 and the precision is 0.406. The F1 score here is equal to 0.554, this is the best performance in our project. In addition, the accuracy of NN in the final test session is 0.936.

Fig 5. IAADP Confusion matrices of SVM(left) and NN(right).
Fig 6. Learning curves of our NN by using IAADP (left is the loss curve, center is the recall curve, and right one is the precision curve).
Last, we’re going to compare and discuss the results of two algorithms and different classification models. There is a table below which summarizes the performances of our every model and experiment (Table 1.). In the AADP algorithm, the recall and precision obtained by SVM is greater but NN is poor. We think that this phenomenon makes sense to a certain extent because we guess that NN is trying to find the pattern in AADP but this algorithm doesn’t follow the principle of biochemistry reaction. Thus, it may find some wrong patterns that only occur in training dataset and underperform. In contrast, IAADP has opposite performance to AADP. We reckon that SVM doesn’t have enough ability to generalize the correct feature representation in the data. On the other hand, by using NN, we have highest recall and precision through our experiments. It not only will help us reduce the experiment cost but also find a great number of actual positive relationships. Then, we compare the performance of SVM and NN in our experiment. With regards to AADP, they don’t have a great ability to classify the protease substrate relationships. Nevertheless, in the case of IAADP, NN has better performance than SVM. We can conclude that the non-linear model (NN) has the better ability for our generalization, and linear SVM isn’t good enough to have this outstanding performance.
Table 1. The performance summary of our experiment
This project provides a novel model to predict the protease-substrate relationship that only uses sequence data and some protease properties. Since it only needs to analyze the sequence data, it has a significantly lower cost than previous methods. At the core, our model necessarily has extraordinary accuracy performance. In addition, our model has outstanding performance on recall. In other words, our model can find a large percentage of actual protease-substrate relationships. Moreover, it also has great ability on precision. We'll not only reduce a considerable number of experiments but also get a lot of successful results after our prediction.

With regards to Nattokinase, we conduct an interesting experiment about the relationship to fibrin. We predict the cleavage sites on fibrin with our model that are worth experimenting and verifying.

There are several compelling directions for further study. The first is trying more different classification models, e.g., various neural networks, kernel SVM. The second is combining the amino acid properties with the protease sequence, but by doing this, it might need exceptional feature selection techniques. The third is that we could do more experiments on the benchmark data to prove the robustness of our model. The last and the most important is generalizing our model for the broader protease families, and making our model more realistic and applicable.
[1] Chen, H., McGowan, E. M., Ren, N., Lal, S., Nassif, N., Shad-Kaneez, F., Qu, X., & Lin, Y. (2018). Nattokinase: A Promising Alternative in Prevention and Treatment of Cardiovascular Diseases. Biomarker insights, 13, 1177271918785130.

[2] Si, D., Moritz, S.A., Pfab, J. et al. Deep Learning to Predict Protein Backbone Structure from High-Resolution Cryo-EM Density Maps. Sci Rep 10, 4282 (2020).

[3] Summary for family S8 - MEROPS - the Peptidase Database.

[4] Superzchen. AAindex data in iFeature github repository.

[5] NYCU-Taipei software repository.

[6] The application of our project on Google Colab.

[7] Li, F., Wang, Y., Li, C., Marquez-Lago, T. T., Leier, A., Rawlings, N. D., Haffari, G., Revote, J., Akutsu, T., Chou, K. C., Purcell, A. W., Pike, R. N., Webb, G. I., Ian Smith, A., Lithgow, T., Daly, R. J., Whisstock, J. C., & Song, J. (2019). Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods. Briefings in bioinformatics, 20(6), 2150–2166.

[8]GitHub Judging Release.
Authored and maintained by Team NYCU-Taipei 2021.