Dataset Preparation
The training dataset we used is from MEROPS [3]. We select the proteases and their cleavage sites in the S8 family. There are 50 different proteases and their cleavage sites. After removing duplicate sequences and sequences with missing values. We yielded 743 cleavage sites and 1504 protease sequences. The features we used in our project are peptidase unit, active site, and amino acid sequence. The format of our data is in CSV format. We’ll introduce the format in the next paragraph.
Feature Selection
We present a feature selection algorithm for protease called “Amino Acid Distance Pair” (AADP). We transform the sequence data to numerical data by calculating the number of ordered pairs of all the combinations of amino acids and their distance in the sequence (Fig 1). In Fig 1 as an example, our data will count every pair in the sequence. The pair “AL” with distance one will count twice, and “LA” with distance two will count once (note that we consider the ordered pair), etc. The idea behind this algorithm is that some specific amino acids formed the features in folding structures (e.g., serine protease). Thus, we count the number of occurrences and find the critical amino acid combinations in the sequence.
However, the actual case is much more complex. In knowledge of biology, there are a few amino acids in the sequence that play a crucial role. Thus, all the amino acid pairs cannot be equally important. As mentioned above, we introduce “active site” and “peptidase unit” into our algorithm. These two data can tell us which amino acids are crucial and where the reaction exactly happens, respectively. Hence, we only consider the pair with those active sites and the amino acid in the range of peptidase units. As the above approach, we reduce the data size and give the data a terrific representation from the biology viewpoint. We called this algorithm “Improved AADP” (IAADP). If we take Fig 1 as an example again and we assume that the amino acid on the third position, “H”, is one of the active sites. In IAADP, we only consider the pair that “H” was involved in. We will discuss these two algorithms more later.
The data of cleavage sites is a character sequence of amino acids with length 8. The sequence of amino acids cannot express any properties of cleavage sites. Thus, we introduce the Amino Acid index database (AAIndex) [4]. For each amino acid, it contains 531 different numerical biochemistry properties. Therefore, we extend every amino acid in the cleavage site to a vector with 531 components, and every cleavage site will become a matrix with a size 531 * 8.
The final input data is that we combine each protease feature and cleavage site matrix, and the label is 0 or 1 means whether this is a positive protease-substrate relationship (1 means yes, 0 otherwise).
Fig 1. AADP example
Model Design and Experiment
The past research problems about cleavage site prediction focused on a specific protease. About the methods, they concentrated on feature selection, plus some simple classification models, e.g., SVM [7]. Despite analyzing a family of proteases (S8-family), we still consider this kind of simple model. Recently, neural networks have a superb ability to fit the data and extract the features, so we experiment with neural networks, too.
For the case of SVM, we introduce it with the linear kernel and execute the 10-fold validation during the experiment. For the case of the neural network (NN), we use two dense layers for our model (Fig 2.), set the step size to 0.0001, and run 250 epochs for each experiment, and we use the recall-precision-based loss function.
There are extremely numerous negative labels in our dataset. However, we hope our model really learns the hidden features in our data instead of guessing the answer according to the distribution of training labels. So, the true distribution in our dataset is [1066224:51248] (about 4% positive labels), and the training dataset we design is [15040:7834] (we use about 30% in our training dataset), which is far away from the true distribution. During the testing section, we use the whole dataset as the testing dataset.
Fig 2. NN model summary