Step 3: A multiple sequence alignment of our 3 sequence sets
Now we conducted two Multiple Sequence Alignments (MSA) with the sequence sets from step 1 and 2. Each one of the reference sets was aligned with the set of CPRs and the set of CYPs in a MSA using MAFFT (Multiple Alignment using Fast Fourier Transform). We executed the MAFFT command line tool using the following command:
mafft --amino --quiet --localpair --maxiterate 100 --distout --treeou --thread 4 [sequences_filepath] > [msa_filepath]
This resulted in the following alignment for the Nelson set:
Figure 2: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and 63 known reference fusion protein sequences from: https://drnelson.uthsc.edu/p450seqs-dbs/ [7]
And a similar picture for the UniProt set:
Figure 3: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and 34 known reference fusion protein sequences from the UniProtKB: https://www.uniprot.org [8]
The images show many small gaps within the alignment. On closer inspection, it is obvious that these gaps are mainly caused by very few of the reference sequences. After removing these outliers, the MSAs look a lot cleaner:
Figure 4: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and known reference fusion protein sequences from: https://drnelson.uthsc.edu/p450seqs-dbs/ [7] where the outliers in the reference sequences that caused large gaps in the alignment were removed
Figure 5: Multiple sequence alignment of 100 CYP sequences (similar to CYP1A1), 100 CPR sequences (similar to rat-CPR) and known reference fusion protein sequences from the UniProtKB [8] where the outliers in the reference sequences that caused large gaps in the alignment were removed
Another observation is that the sequences within the three distinct sets (CPRs, CYPs and reference sequences) align well with each other. Furthermore the CYPs mainly align with the front part of the reference sequences, while the CPRs align with the back of the reference sequences. This is a clear indicator that we indeed have fusion proteins of CYPs and CPRs in our reference sets. In the middle part of our reference sequences we find an area that does not align with neither the CYPs nor the CPRs. We conclude that this area must be the linker area of our reference sequences. However, the aligning areas of the CYPs and CPRs are not fully disjoint despite a clear cut in the alignment of the linker area. Instead, a set of short suffixes of the CYPs align in the area of the CPRs and a set of small prefixes of the CPRs align in the area of the CYPs. In these cases, we conclude that the suffixes and prefixes correspond to structurally unessential parts of the proteins that are neither present nor important for the function of a fusion protein and should therefore be removed before joining the two sequences in the laboratory.