Prediction of linker sequences with multiple sequence alignments

The main goal of our project is to optimize the electron transfer in enzymatic redox reactions. We do this by fusing members of the electron transport chain, more precisely by linking a cytochrome P450 enzyme (CYP) to a reductase (CPR) creating a self-sufficient fusion protein.

Similar fusion complexes exist in nature, more precisely in prokaryotes and fungi [1][2] and it was already shown that it is possible to artificially engineer new fusion complexes with similar properties and the opportunity to modify their functionality as needed. [3][4][5]

When designing fusion proteins there are some parameters that have to be considered. One of them is the selection of suitable linkers between the two different domains. Direct fusion of two domains could lead to misfolding and aggregation of domains which impairs their bioactivity. A lot of linkers have been developed based on the knowledge of natural linkers in multi-domain proteins [6]. Since the linkers and especially their length play a crucial role in the functionality of fusion proteins our goal was to predict possible linker sequences for our CYP-CPR proteins. We essentially did this by aligning our two separate protein sequences to a library of known fusion proteins.

Figure 1: Workflow of the linker prediction

Step 1: Finding a set of known fusion proteins

The first step was to find a number of sequences of known fusion proteins. For this we had two approaches: Dr. D. Nelson from the University of Tennessee Health Science Center provides a large collection of P450 enzyme sequences ( From this collection we derived all sequences above the length of 800 assuming that these are fusions of two P450 related proteins. This set then consisted of 113 sequences (see supplement).

In a second approach we used the following search query in the UniProt database:

“gene:cyp length:[800 TO *] name:bifunctional”

Which resulted in a set of 34 sequences (see supplement).

With each of these two reference sequence sets as subject sequences we conducted a BLAST search [9], with our CYP1A1 and a CPR sequence of the rat as query sequences. The results were two sets of known sequences (see supplement) that are similar to the sequences of the proteins we want to fuse. We will denote these reference sets with Nelson (63 sequences) and UniProt (34 sequences) respectively.

Step 2: Generating sequence profiles for our CYP and CPR sequences

Now we executed a BLAST search [9] with default settings in the database of non-redundant protein sequences (nr) with our two query sequences: The CYP1A1 sequence and the CPR sequence of a rat. This way we found a set of similar protein sequences for each protein sequence (CYPs and CPRs).

Step 3: A multiple sequence alignment of our 3 sequence sets

Now we conducted two Multiple Sequence Alignments (MSA) with the sequence sets from step 1 and 2. Each one of the reference sets was aligned with the set of CPRs and the set of CYPs in a MSA using MAFFT (Multiple Alignment using Fast Fourier Transform). We executed the MAFFT command line tool using the following command:

mafft --amino --quiet --localpair --maxiterate 100 --distout --treeou --thread 4 [sequences_filepath] > [msa_filepath]
This resulted in the following alignment for the Nelson set:

Figure 2: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and 63 known reference fusion protein sequences from: [7]
And a similar picture for the UniProt set:

Figure 3: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and 34 known reference fusion protein sequences from the UniProtKB: [8]
The images show many small gaps within the alignment. On closer inspection, it is obvious that these gaps are mainly caused by very few of the reference sequences. After removing these outliers, the MSAs look a lot cleaner:

Figure 4: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and known reference fusion protein sequences from: [7] where the outliers in the reference sequences that caused large gaps in the alignment were removed
Figure 5: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and known reference fusion protein sequences from the UniProtKB [8] where the outliers in the reference sequences that caused large gaps in the alignment were removed
Another observation is that the sequences within the three distinct sets (CPRs, CYPs and reference sequences) align well with each other. Furthermore the CYPs mainly align with the front part of the reference sequences, while the CPRs align with the back of the reference sequences. This is a clear indicator that we indeed have fusion proteins of CYPs and CPRs in our reference sets. In the middle part of our reference sequences we find an area that does not align with neither the CYPs nor the CPRs. We conclude that this area must be the linker area of our reference sequences. However, the aligning areas of the CYPs and CPRs are not fully disjoint despite a clear cut in the alignment of the linker area. Instead, a set of short suffixes of the CYPs align in the area of the CPRs and a set of small prefixes of the CPRs align in the area of the CYPs. In these cases, we conclude that the suffixes and prefixes correspond to structurally unessential parts of the proteins that are neither present nor important for the function of a fusion protein and should therefore be removed before joining the two sequences in the laboratory.

Step 4: Choosing a linker sequence for our fusion protein

Speaking of the laboratory, we now selected some linker sequences for experimental testing. Generally all of the sequences in the linker area of our MSA beginning and ending at the clear cut positions would be examples of linker sequences found in known CYP-CPR fusion proteins. For our selection we used the following criteria:

- similarity to our CYP and CPR sequences

- protein family (since we had distinct CYP-families in our reference sequences)

- annotation score of the corresponding reference sequence

This resulted in 6 linker sequences and respective trimmed CYP and CPR sequences for experimental testing in the laboratory:

The used sequence sets can be downloaded here: Supplement


[1] Hong, M., Kim, R. N., Song, J. Y., Choi, S. J., Oh, E., Lira, M. E., ... & Choi, Y. L. (2014). HIP1–ALK, a novel fusion protein identified in lung adenocarcinoma. Journal of Thoracic Oncology, 9(3), 419-422. doi: 10.1097/JTO.0000000000000061

[2] Hoffmann, I., & Oliw, E. H. (2013). Discovery of a linoleate 9S-dioxygenase and an allene oxide synthase in a fusion protein of Fusarium oxysporum [S]. Journal of lipid research, 54(12), 3471-3480. doi: 10.1194/jlr.M044347

[3] Valiyari, S., Salimi, M., & Bouzari, S. (2020). Novel fusion protein NGR-sIL-24 for targetedly suppressing cancer cell growth via apoptosis. Cell biology and toxicology, 36(2), 179-193. doi: 10.1007/s10565-020-09519-3

[4] Yu, K., Liu, C., Kim, B. G., & Lee, D. Y. (2015). Synthetic fusion protein design and applications. Biotechnology advances, 33(1), 155-164. 10.1016/j.biotechadv.2014.11.005

[5] Sadeghi, S. J., & Gilardi, G. (2013). Chimeric P 450 enzymes: Activity of artificial redox fusions driven by different reductases for biotechnological applications. Biotechnology and Applied Biochemistry, 60(1), 102-110. doi: 10.1002/bab.1086

[6] Shamriz, S., Ofoghi, H., & Moazami, N. (2016). Effect of linker length and residues on the structure and stability of a fusion protein with malaria vaccine application. Computers in biology and medicine, 76, 24-29. doi: 10.1016/j.compbiomed.2016.06.015

[7] Nelson, DR (2009) The Cytochrome P450 Homepage. Human Genomics 4, 59-65. url:

[8 ] UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515.url:

[9] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389-3402.

© iGEM Hamburg 2021