Step 3: A multiple sequence alignment of our 3 sequence sets
Now we conducted two Multiple Sequence Alignments (MSA) with the sequence sets from step 1 and 2. Each one of the reference sets was aligned with the set of CPRs and the set of CYPs in a MSA using MAFFT (Multiple Alignment using Fast Fourier Transform). We executed the MAFFT command line tool using the following command:
mafft --amino --quiet --localpair --maxiterate 100 --distout --treeou --thread 4 [sequences_filepath] > [msa_filepath]
This resulted in the following alignment for the Nelson set:
And a similar picture for the UniProt set:
The images show many small gaps within the alignment. On closer inspection, it is obvious that these gaps are mainly caused by very few of the reference sequences. After removing these outliers, the MSAs look a lot cleaner:
Another observation is that the sequences within the three distinct sets (CPRs, CYPs and reference sequences) align well with each other. Furthermore the CYPs mainly align with the front part of the reference sequences, while the CPRs align with the back of the reference sequences. This is a clear indicator that we indeed have fusion proteins of CYPs and CPRs in our reference sets. In the middle part of our reference sequences we find an area that does not align with neither the CYPs nor the CPRs. We conclude that this area must be the linker area of our reference sequences. However, the aligning areas of the CYPs and CPRs are not fully disjoint despite a clear cut in the alignment of the linker area. Instead, a set of short suffixes of the CYPs align in the area of the CPRs and a set of small prefixes of the CPRs align in the area of the CYPs. In these cases, we conclude that the suffixes and prefixes correspond to structurally unessential parts of the proteins that are neither present nor important for the function of a fusion protein and should therefore be removed before joining the two sequences in the laboratory.