Introduction - How we got here

In the center of our project stands the creation of artificial fusion proteins. By combining a Cytochrome P450 protein (CYP) and a Cytochrome P450 Reductase protein (CPR), we hope to create a self-sufficient fusion complex that is capable of transferring electrons in various ways. Similar fusion complexes exist in nature [1][2] and it was already shown that it is possible to artificially engineer new fusion complexes with similar properties and the opportunity to modify their functionality as needed. [3][4][5].

When creating such chimeric proteins it is necessary to combine the protein’s amino acid sequences. For the electron transfer between both components to work their reaction centers need a specific distance to one another [6]. Therefore it is not enough to simply extend one sequence by the other. Instead, a linker sequence is inserted between both sequences, allowing the proteins to reach a suitable relative position and orientation to one another to enable electron transfer. The success of this composition is largely dependent on the length and structure of the linker sequence, as it determines what relative positions and orientations are possible [6]. However, finding a working linker length and sequence is a challenging problem, since the desired complex structure of both proteins is unknown. In this work, we attempt to derive possible linker sequences by harvesting known sequences from known fusion proteins. This workflow is presented as an automated command-line tool named LEA (Linker Extraction from Alignments) which is based on Multiple Sequence Alignments. LEA is available under:

Pipeline - What we do

Figure 1: Workflow of LEA
The purpose of this tool is to propose linker sequences for novel fusion proteins that resemble already existing fusion proteins. Therefore we assume that the user provides two amino acid sequences A and B that they wish to combine and a set of known fusion proteins that function as reference sequences. These reference proteins should contain similar domains to the proteins that the user wants to fuse. To give an example, in our project we were interested in combining a CYP protein with a CPR protein, creating a self-sufficient CYP-CPR protein complex. Therefore we collected a dataset of all known self-sufficient CYP-CPR protein complexes from the UniProt database since these proteins resemble the type of protein we wanted to create. Note that the individual reference proteins did not contain the exact combination of CYP and CPR protein that we intended to create. The workflow we propose is divided into 3 parts.

In the first part, we execute a BLAST search of the two query sequences to find a set of similar proteins of each (set A and set B). Next, the set of reference sequences are filtered through two more BLAST searches with the query sequences, only keeping the reference sequences that have minimum similarity to both query sequences (set R).

In the second part, these three sets are then aligned in a Multiple Sequence Alignment (MSA) using the command line tool MAFFT (Multiple Alignment using Fast Fourier Transform). The purpose of these steps is to identify the linker areas in the reference sequences. Ideally, the MSA shows the following desired properties. First, the individual sets A, B and R should align well internally. Next, set A should show a contiguous alignment with a front part of set R, and set B should align with a back part of set R. Last, the areas of set R aligning with set A and set B should be disjoint with a clear cut in the alignments, resulting in an area in the middle of set R that does not align with A or B (see image). This area can be interpreted as already existing linker sequences.

Figure 2: How we expect our MSA to look like with our three sequence sets
In practice, however, the resulting MSA rarely fulfills these criteria without some adjustments. In many cases, there are gaps all over the alignment that are caused by a few reference sequences from set R which do not align well with the rest of the sequences. These few outliers can be removed from the MSA as a data cleanup step. Another common issue is that the aligning areas of set A and B are not fully disjoint despite a clear cut in the alignment of the linker area. Instead, a set of short suffixes of set A aligns in the area of set B and a set of small prefixes of set B aligns in the area of set A. In these cases, we conclude that the suffixes and prefixes correspond to structurally unessential parts of the proteins that are neither present nor important for the function of a fusion protein and should therefore be removed before joining the two sequences. An example of this case is shown in the demonstration chapter.

The final part of the workflow consists of several steps to automatically identify and extract the linker sequences in set R following the above line of thought. Therefore, the approach in this part is to automatically detect the clear alignment cuts at the beginning and end of the linker area while ignoring the suffixes and prefixes that should be removed. Once the area is detected accordingly, the corresponding linker sequences can be extracted. In a final step, all possible linker sequences are returned to the user ranked after similarity of the corresponding reference sequence to our query proteins. Additionally, the tool provides the positions where the query sequences should be cut before joining them together.

Implementation - What’s under the hood

The presented workflow is implemented as a command-line tool written in Python. The code is available at our GitHub repository

The BLAST searches in the first part of the workflow are executed by using the REST Web API provided by the NCBI[7]. To filter the reference sequences by similarity to the two query sequences we use the standalone version of BLAST[8]

The MSA calculation is performed by the tool MAFFT[9] using the --local-pair option. This ensures the most accurate local alignment around the linker area which is important for a precise linker detection.

The cleaning process removes gaps from the alignment that are caused by very few or single sequences. This enables the linker detection in the next step. We define a gap as a set of subsequent columns in the MSA in which most sequences only contain gap symbols (‘-’). This process is controlled by two parameters, w and n. To find the gaps we iterate through all positions of the MSA from left to right. From each position we investigate the next w columns. If all but n sequences (or less) in the current window contain only gap symbols, we remove the n sequences. This leads to fully empty columns. After scanning through the MSA all empty columns are removed. This process could again lead to new gaps. Therefore we repeat this process until no new gaps are found and removed. With this procedure we ensure that no gaps wider than w columns remain in the MSA.

Next we identify the linker area and extract the linker sequences. We define a linker area as the longest set of subsequent columns that consist of mostly gap symbols in both query sequence groups A and B. Here a small margin of error is allowed to account for potential outliers in the MSA. This is controlled by the parameter p. To find the linker area we iterate over the MSA again, marking every column as a possible linker position that fulfills the requirement. A column counts as a possible linker position if at least p percent of all query sequences of group A and B show a gap symbol in the column. Therefore at most p percent if the sequences are tolerated to align in this position. After finishing the marking process, we identify the longest set of subsequent marked columns and interpret it as the linker area. Finally the linker sequences are extracted by taking all sequences inside the linker area and removing the gap symbols.

All parameters are adjustable, but we found the following default settings effective:

  • number of sequences per group: 100
  • maximal e-value for reference sequence filtering: 0.01
  • minimum number of sequences to protect gap from removal: 5
  • maximum gap width: 5
  • percentage of gap symbols per column to count as linker: 0.95
Demonstration - What it looks like

In our project, we started with two amino acid sequences of a CYP and a CPR (see GitHub repository). Next, we gathered a set of 113 sequences of known self-sustaining CYP-CPR fusion proteins from the UniProt database. We then performed the two BLAST searches for our two query sequences as described above, resulting in two sets of 100 CYP and 100 CPR sequences that were similar to our query sequences. Filtering the reference sequences with two blast searches left a set of 64 reference sequences to start the alignment process. The result of the MSA is shown in figure 3.

Figure 3: Raw Multiple Sequence Alignment before cleaning procedure

The image shows many small gaps within the alignment. On closer inspection, it is obvious that these gaps are mainly caused by very few of the reference sequences. After removing these outliers, the MSA looks a lot cleaner, as shown in figure 4. Also visible in the image are the suffix and prefix areas of set A and B described before. These should be removed before joining the two sequences at the end, which results in two completely disjoint alignment areas.

Figure 4: Multiple Sequence Alignment after cleanup
Finally, there is a clear cut both in set A and B that leaves a middle part of set R completely without alignment with set A and B. This is the desired linker area. All of the sequences in this area, beginning and ending at the clear cut positions, are examples of linker sequences found in known CYP-CPR fusion proteins.

Figure 5: Screenshot of the results file by LEA with the extracted linkers
To test our tools performance we executed LEA with another example sequence derived from our results obtained above. For this we took an arbitrary predicted linker sequence (in this case the sequence with the ID seq1 of CYP102A4 Bacillus anthracis str. Ames.) and removed the corresponding sequence from our reference sequence file. Then we executed LEA with one part of seq1 as sequence A and the other part as sequence B (split at the predicted linker sequence which we removed).

In the end we checked how similar the predicted linker sequences were to the actual linker sequences via BLASTp.

Figure 6: Screenshot of the BLAST search of the test linker sequence with the predicted linkers
Figure 7: Screenshot of the query coverage of the BLAST search of the test linker sequence with the newly predicted linkers
We can observe that even though the predicted linkers are obviously not exactly the same linker as the actual one, all of the predicted sequences are fairly close to it and return as a significant hit in the blast search. Although the amino acids vary in some regions of the linker, this could be irrelevant for the structure of the final linker and most importantly for the functionality of the fusion protein. With this test case we could confirm that LEA also works for other sequences with good results. However, there is still a lot of room for improvement. For example one could have a look at the resulting 3D-structures. This would be possible with the newly released AlphaFold [10] which predicts a protein's 3D structure from its amino acid sequence. This approach would be ideal to confirm that our predicted fusion protein sequences produce structurally sensible results.

Supplemental data and the shown example files can be found in our GitHub repository


[1] Hong, M., Kim, R. N., Song, J. Y., Choi, S. J., Oh, E., Lira, M. E., ... & Choi, Y. L. (2014). HIP1–ALK, a novel fusion protein identified in lung adenocarcinoma. Journal of Thoracic Oncology, 9(3), 419-422. doi: 10.1097/JTO.0000000000000061

[2] Hoffmann, I., & Oliw, E. H. (2013). Discovery of a linoleate 9S-dioxygenase and an allene oxide synthase in a fusion protein of Fusarium oxysporum [S]. Journal of lipid research, 54(12), 3471-3480. doi: 10.1194/jlr.M044347

[3] Valiyari, S., Salimi, M., & Bouzari, S. (2020). Novel fusion protein NGR-sIL-24 for targetedly suppressing cancer cell growth via apoptosis. Cell biology and toxicology, 36(2), 179-193. doi: 10.1007/s10565-020-09519-3

[4] Yu, K., Liu, C., Kim, B. G., & Lee, D. Y. (2015). Synthetic fusion protein design and applications. Biotechnology advances, 33(1), 155-164. 10.1016/j.biotechadv.2014.11.005

[5] Sadeghi, S. J., & Gilardi, G. (2013). Chimeric P 450 enzymes: Activity of artificial redox fusions driven by different reductases for biotechnological applications. Biotechnology and Applied Biochemistry, 60(1), 102-110. doi: 10.1002/bab.1086

[6] Shamriz, S., Ofoghi, H., & Moazami, N. (2016). Effect of linker length and residues on the structure and stability of a fusion protein with malaria vaccine application. Computers in biology and medicine, 76, 24-29. doi: 10.1016/j.compbiomed.2016.06.015

[7] BLAST Common URL API, Retrieved from

[8] Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics, 10(1), 1-9.

[9] Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780.

[10] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.

© iGEM Hamburg 2021