# SOFTWARE

### Introduction

Our dry lab team decided to contribute to both parts of “AmeBye” project. The main software was realized for diagnostics part - we implemented aptamer generating software by using entropy fragment based approach (EFBA) and extended it with a transformer neural network model (TEA).

For the naringenin synthesis part, in the process of a computational fusion protein model development we found several small tools for bioinformatics to be useful, which we decided to present in the software page along with the main software.

All these software tools can be found in our repository.

### EFBALite

#### Motivation

Creating aptamers in the laboratory is an expensive and time consuming process. Thus we were inspired by [16] to create software that would help with this process. The result of our efforts is EFBALite - an implementation (with several improvements and additions) of the Entropy Fragment-Based Approach (EFBA) in the Python programming language that is able to generate an aptamer sequence from scratch.

The program and user guide is available at the GitHub repository.

#### Description and theory

As the name suggests, the algorithm used by EFBALite is based on a statistical measure called relative entropy (or Kullback–Leibler divergence), thus to begin we need to construct a probability space. We use statistical mechanics to accomplish this goal.

Let $$\Omega$$ denote the phase space of our aptamer target complex. The PDF (point density function) of $$\Omega$$ (considered as a random variable) is given by $$f_{\Omega} (Q) = \frac{e^{-\beta H(Q)}}{Z},$$ where $$H$$ is the Hamiltonian, $$\beta$$ is a constant (which depends on temperature) and $$Z = \int_{\mathbb{R} ^ n} e^{-\beta H(Q)} dQ$$ is the partition function.

We approach the problem of finding the suitable aptamer for our target using a "lockpicking" approach. That is we try to generate the aptamer one nucleotide at the time, instead of trying to guess the whole sequence at once. This approach allows us to avoid a combinatorial explosion for longer sequences.

Suppose we are trying to fit an aptamer of length one (that is a single nucleotide) to our target. The goal of our approach is to find a nucloetide that binds the best when compared to the binding of a random nucleotide. In our approach we measure this by seeing how random the distribution of $$\Omega$$ is with a specific nucleotide instead of a random one. Alternatively, we can formulate this as "how much information do we gain about $$\Omega$$ if we use this specific nucleotide instead of a random one?". We then of course pick the nucleotide with the biggest information gain.

The information gain can be quantified as relative entropy. But first, more notation, let $$\Omega^R$$ denote the phase phase of the complex with a random nucleotide and $$\Omega^N$$ the phase phase of the complex with a particular nucleotide, where $$N$$ can be equal to A, C, G, T or U (depending on your flavor of aptamer). Then the relative entropy $$D_{KL}$$ from $$\Omega^R$$ to $$\Omega^N$$ is equal to $$D_{KL} = \int_{\mathbb{R}^n} f_{\Omega^N}(Q) \log\left(\frac{f_{\Omega^N}(Q)}{f_{\Omega^R}(Q)}\right) dQ,$$ note that $$D_{KL}$$ is not symmetric in terms of $$\Omega^R$$ and $$\Omega^N$$.

The $$f_{\Omega^R}(Q)$$ can be calculated as $$f_{\Omega^R}(Q) = \sum_{n} P(N = n)f_{\Omega^n}(Q).$$ We took $$P(N = n)$$ to be equal to $$1/4$$, but alternative weights can be used if one knows the prior distribution of nucleotides in the aptamer.

We find the subsequent nucleotides in the aptamer by fixing the sequence that we have already found and then computing the relative entropies for the next nucleotide.

For the practical implementation of the algorithm we construct a discrete phase space. For the first nucleotide, we sample points on the surface of the target which are approximately $$x$$ angstroms apart ($$x$$ can be specified as desired) and additionally we rotate the nucleotide by 0 and 180 degrees along all Euler angles, where we consider the O3 atom of the nucleotide to be the zero point of the Euclidean space. For the subsequent nucleotides, we keep the already determined sequence fixed in space and rotate the next nucleotide by -90, -60, -30, 0, 30, 60 and 90 degrees along all Euler angles by considering the phosphate atom as the zero point of the Euclidean space.

Fig. 1. The O3 and P atoms between two guanine nucleotides

The "lockpicking" approach does not find the optimal sequence in the space of all sequences in terms of information gain. To get as close to the global maximum we implement a searching procedure by specifying an information gain threshold $$T$$ which each nucleotide in the sequence must reach. This goes as follows. For the first nucleotide take the one with the highest relative entropy, fix it to the phase space position that is most probable. Now compute the second nucleotide, if the maximum relative entropy is higher than $$T$$ keep the nucleotide, fix it to the most probable position and continue, if not backtrack to the first nucleotide and check the second most probable position. Continue this until you find a sequence of desired length or exhaust all possible sampling positions.

#### Implementation

The program is written using the Python3 programming language.

To compute the energy of the aptamer target complex we use the OpenMM package and use the OpenMM's implementation of the AMBER14 forcefield.

The program finishes generating an aptamer in reasonable time (overnight) on a personal computer with a GPU.

#### Validation

The testing of the software consisted of two parts: computational and experimental validation. Computational modeling of aptamer and protein binding was performed using docking workflow (more detailed explanation can be found in the bullet point list below).

The computational modeling flow that we decided to apply to our case is made of:

• Determining the two-dimensional structure of the aptamer (Mfold [17])
• In case of several outputs of Mfold, assumption was made that the secondary structure of an aptamer is the one that has the smallest free energy change.
• Determining the three-dimensional structure of the aptamer (RNAComposer [18])
• Assumption was made that the aptamer folds into the 3D shape that was determined by RNAComposer program.
• Nonetheless, this program is designed for RNA 3D structure determination, according to the molecular docking review article [21], there were studies [22, 23, 24] performed, which showed that atomic composition of DNA and RNA is highly similar, thus the programs that generate the 3D structure of RNA can be used for DNA as well.
• Modelling the target (protein) if its structure was not determined experimentally
• Assumption was made that the target protein has the structure that was modeled with a modeling program. We opted for AlphaFold2 [19]. The detailed explanation about modeling choices can be found on Fusion Protein Modeling page.
• Docking the 3D aptamer on the modeled target (HDOCK [20])
• Since it is impossible to exhaust all possible conformations of aptamer - protein complex, docking programs include sampling methods that extract a subset of probable conformations and chooses the best one among them. HDOCK, precisely, performs global search using Fast Fourier transform based search algorithm.

Experimental validation was also performed. We reached out to TU Delft, the iGEM team that was also working with aptamers, for help with experimental validation (the whole story about our partnership can be found on our Partnership page ). We generated aptamer sequences for their protein of interest RBP4 (Table 2) and they performed electrophoretic mobility shift assay (EMSA) to evaluate the affinity between RBP4 and the generated aptamers.

Table 1. EFBA generated aptamers for albumin and EhPPDK with controls
Aptamer Name Sequence Mfold score HDOCK score
Original aptamer from EFBALite (25 nt) ALB.O GGATATAATTGGTTTTTTCTTGTGT 0.95 -318.64
“Better than random” aptamer ALB.BTR GTGCACGTGTATTTATTCGTTTTTT - -291.23
Handmade aptamer based on EFBALite (83 nt) ALB.H GCGCGCGGGCGATATATATATATATCGCCGGATATAATTG
GTTTTTTCTTGTGTCGCCTATATATATATATAGGCGCGCG
CGC
-20.59 -358.76
Random 69 nt aptamer ALB.69 GACGTCGCCCTGAAAAAAAGATTTCTGCAACTCTCCTCGT
CAGCAGTCTGGTGTATCGAAAGTACAGGA
- -450.83
Original aptamer from EFBALite  EhPPDK. O GAGTTTAGGGGTCTTTGTTTTGGTGTT 0.47 -394
Random loop introduced for stabilization EhPPDK.R GAGTTTAGGGGTCTTTGTTTTGGTGTTAGAAACACCAAAA
CAAAGGTTAGACGCTACCCCCTAAACTC
-23.46 -318.27
“Loop” shape aptamer EhPPDK.L GAGTTTAGGGGTCTTTGTTTTGGTGTTGAGTTTAGGGGTC
TTTGTAACACCAAA
-5.43 -398.39
Table 2. Scores of the final set of single-stranded DNA and RNA sequences for RBP4
Aptamer Sequence Mfold score HDOCK score
RBP4_30 GTTGATTGTTATGTTTAGTGACGGGTTCCC 0.78 -363.45
random_30 AGGGTCACATGGGCGTTTGGCACTACCGAC -1.22 -356.26
RBP4_RNA_30 GUCCCCCGCCCGUGUCCCGCUAGCCCCGCG -1.6 -376.82
random_RNA_30_1 CUGUUUUCGAAAUUACCCUUUAAGCGCGGG -2.20 -305.81

#### Conclusions

Unfortunately, according to the results that TU Delft team received, neither of the sequences were bound to RBP4 molecule.

Additionally, the sequences generated for albumin and EhPPDK with EFBA software were tested with an aptamer-based Western blot. In the result, the generated sequences did not show any binding signal.

There were sequences for three target proteins tested. In order to make more confident conclusions, more protein target cases should be analyzed. Although, the results that we received might lead to the conclusion that EFBA approach on its own is not capable to generate affine aptamer sequences.

### EFBAScore

#### Motivation

The Entropic-Fragment Based approach can be used to create a function that would score the aptamer's affinity to the target molecule. We implemented this scoring function in our second program - EFBAScore. Admittedly, this scoring function is not very accurate, however the main benefit is that it can be used to perform bulk scoring (to score many sequences with high computational efficiency) which allowed us to generate the starting dataset for Transformer Enhanced Aptamers (TEA) - our third piece of software.

The program and user guide is available at the GitHub repository.

#### Description and theory

The main ideas are the same as EFBALite. However, now we have an input sequence and do not have to compare entropies between different nucleotides. Thus, our final score is just the sum of entropies computed at every step.

To pick the spatial position for the nucleotide we use the position that has the highest probability. In order to compute the entropy of a nucleotide, we use a uniformly distributed random variable to compute the relative entropy at every step instead of $$\Omega^R$$.

#### Implementation

The program is written using the Python3 programming language.

To compute the energy of the aptamer target complex we use the OpenMM package and use the OpenMM's implementation of the AMBER14 forcefield.

### EFBA extention TEA (Transformer Enhanced Aptamers)

#### Abstract

We took a step further and applied a novel transformer-based neural network (NN) model ALBERT which was combined with a genetic algorithm (GA) to make aptamer generation in silico a more resource-efficient process.

Firstly, the algorithm was enhanced with a Bayesian probabilistic model to define a finite GA iteration number that helps users to determine when a list contains a proper number of fit aptamers. Next, by employing Markov chain Monte Carlo (MCMC) methodology we were able to analyse the NN model error rate which gave insights about the probability of throwing away the fit aptamer from a final list and the space for future improvements. Finally, the Pytorch NN model was rewritten to the ONNX framework which sped up the algorithm more than 3 times, and overall more than 300 compared to EFBALite. Worth mentioning, model has a key property of transfer learning - anyone who has a similar task can fine-tune it on different target proteins and use it instantly to generate fit aptamers. Additionally, the model is accessible in an AI community framework Hugging Face.

### Dataflow

Initially, N (1500 in our case) random aptamer sequences are generated by using EFBA. Following it up, the data must be specifically preprocessed to contain a pair of aptamers with a binary label that determines if the second sequence is more fit (1) or not (0).

Table 3. Example of sequence labeling
Sequence1 Sequence2 Label
TGATCACGCAGGCAT GAGACCTTTCGTGTA 0
CTAGACCTTATAGAC AACAATCATAAGGCG 1
CTTGCGGTCATACAC TACCCGGGCTTTCGA 0
AATTGGCCCAAATTG GTTTGGAGGCTGATC 0
CGTGCCGCCATCGGC TCATCCCCACAAAGC 1

The dataset of paired sequences is obtained by comparing every aptamer by fitness score which is computed with the former software. Afterwards, the number of labels for classification is balanced by flipping aptamers places for the model to learn both classes equally.

Many transformer-based models could fit this task, however ALBERT model was picked because of its state-of-the-art performance with fewer parameters than the threshold BERT model. ALBERT takes 4-5 times less time to make an inference, saving days of expensive GPU runtime. Another significant part of the model is the genetic algorithm that produces new sequences at every iteration by breeding and mutation steps; those steps ensure we are generating new aptamers with similar “good” properties. Lastly, the sequences of the final iteration are analyzed and compared by EFBA, furthermore the best of it, 5 percent of the total, should be evaluated in the lab.

Additionally, Bayesian probabilistic model was employed to ensure NN convergence and efficiency. The prior distribution was drawn from our beliefs about the aptamers population. Then it was “updated” by generated data (the likelihood function) to get the posterior distribution which can be used to make inferences.

• If more data is observed, the posterior distribution has a smaller variance, which leads to the more precise inference.
• There is no “perfect” prior to choose from, you should rely on some researches or intuition.

Model output gives us more information of GA iteration on average, in this case, that every 1000 aptamers will have at least 3 fit sequences and that there is 85 percent of chance that we will obtain y ∈ {1,2,3,4,5,6} fit aptamers.

Fig. 2. Prior, likelihood and posterior distributions of aptamers for target protein albumin
Fig. 3. Density function of possible number of fit aptamers in 1000 random generated sequences

The next modeling step was to evaluate the risk of removing fit aptamer from the list of the best aptamers. Current TEA model for albumin has a score of 84.6 percent accuracy, this indicates that on average it is a state-of-the-art classifier, however if we have to classify similar aptamers or there is some chaotic behaviour, we might lose an aptamer or two in a GA iteration. MCMC simulation helped us to see a few possible scenarios of chaotic GA iterations (outliers) and how to possibly avoid them - a simple solution is to keep a longer list of the most fit aptamers and frame them aptamers so they cannot be considered as fit in one iteration and not fit in the next.

### Model training and results

ALBERT models for albumin have shown the state-of-the-art performance which can be seen in Fig. 4 and Fig. 6. The large model has shown around 4 percent better performance, however it is twice as slow as the base alternative so we have chosen ALBERT-base for a further inference. Using transfer learning property of transformers we were able to fine-tune the model again for the target protein EhPPDK and it has shown similar results. Fine-tuning took only around 3 hours because NN only needs to retrain its positional embedding for different target protein to work it out, also it was trained on the same size of randomly generated aptamer dataset of 1500 sequences.

Fig. 4. ROC curve for base and large ALBERT models
Fig. 5. ALBERT-base model confusion matrix with descriptive statistics
Fig. 6. ALBERT-large model confusion matrix with descriptive statistics
Table 4. TEA generated aptamers for albumin
Aptamer Score
TCTTGCTGGTATACT 52.55
CTCTTCAGCTTGATC 52.32
TTCTTCAGCTTTACT 52.32
TTCTTCAGCTTTCCT 52.30
TTCTTCAGCTTTTTG 52.28
TTCTTCAGCTACTCT 52.27
TTCTTCAGATCTCTT 52.26
TTCTTCAGCTGCCCC 52.25
TTCTTCAGATCTCCT 52.25
TTCTTCAGCTCCGAT 52.23
Table 5. Generated aptamers for EhPPDK which are not affine to albumin with TEA
Aptamer Albumin Entropy EhPPDK Entropy Difference
ACTTCTCAGGAGCGA 10.73 39.38 28.65
AGTGCAATTGCCTAC 10.73 37.24 26.51
AGTTCGCCATTACTT 10.73 36.44 25.71
AGCGCTCCGGCTTAC 10.73 35.53 24.80
AGCTCAACGATCACG 10.73 35.40 24.67
AGCACGCTGTTATAA 10.73 35.13 24.40
AGCGCGTTATTGGCT 10.73 35.12 24.39
ACCACGTTTTCAGAT 10.73 35.09 24.36
AGCACATTCCAGATG 10.73 35.05 24.32
ATCGCACGCCGATTG 10.94 34.97 24.03

#### Optimization

The PyTorch ALBERT model has been converted to the ONNX framework, which has an accelerated inference speed approximately three times of initial speed, at the end ONNX ALBERT model is capable to compare around 800 aptamers per 3 minutes, which is around 300 times faster than using the former EFBA. We considered a few other possibilities of enhancing the process. For example, diminishing parameters precision from F32 to INT8 provides the faster processing, yet it comes with a decrease in accuracy of the whole model, which is not worth a trade.

#### Usage and documentation

Detailed usage and documentation of the software can be found in the GitHub repository.

#### Conclusions

We evaluated the affinity of the first three TEA generated sequences with aptamer-based Western blot. For each target protein four cases were studied: the first one - a control, for which the protein of interest was taken without any aptamer and other three cases for each of the chosen aptamer from the list.

As mentioned in EFBA section, the EFBALite is not capable to generate affine aptamer sequences, although we used EFBA scoring function (EFBAScore) in the development process of TEA. Therefore, the following statements give more insight about the reliability of EFBA method as a scoring function.

We did not observe any significant signal in the blot with albumin aptamers, although there was a rather bright band at the position of 100 kDa (where EhPPDK should be observed) for the first and second aptamer of EhPPDK, which might indicate binding between the aptamer and the protein. Due to the lack of time, we could not investigate aptamer binding affinity by applying another method. In order to make more confident statements about TEA's ability to produce affine aptamer sequences, more target cases should be studied, different scoring functions and different validation methods should be applied.

To sum up, the observations that we received are promising - the constructed neural network has the potential to produce affine aptamer sequences. Since the field of aptamer-based diagnostics has got a wide field of applications, it might be worth to study the potential of TEA more thoroughly.

### Bioinformatics tools

This section is about the bioinformatics tools that we used for our working flow in silico . We found them helpful in our project development. Thus we decided to describe them here since they might be beneficial for other teams as well. All bioinformatics tools presented in this section are intended to support the small bioinformatics tools manifesto [4].

### GenFusMSA

#### Motivation

After several attempts to model our fusion protein system, we decided to include multiple sequence alignment (MSA) as an additional input into our chosen modeling programs. Both trRosetta and RoseTTAFold generate MSA internally. Since the mentioned programs by default cannot handle our fusion protein cases accurately, we were encouraged to use sequence-search methods to create MSA files for our modeling input ourselves.

#### Description

The presented tool GenFusMSA is a script that generates multiple sequence alignment files that can be used for fusion protein modeling with programs like trRosetta[5], RoseTTAFold[6], and AlphaFold2[7]. By providing our own MSA files, we received more probable structures with less disordered domains that belong to the linked proteins. A more detailed description of our modeling flow can be found on the Wiki page of fusion protein modeling.

The script is a simple program that scans input full query-template .a3m files and pairs sequences according to their taxonomy ID . The paired sequences are joined via a peptide linker that is determined by the user, and the collection of sequences can be saved as an output .a3m file.

#### Usage

From the user side, the script takes in two .a3m files (as parameters -i1 and -i2) which can be generated by the HHblits[3] program. Additionally, the user is asked to set a linker sequence (-l parameter) and how many times (-n parameter) it is repeated. By default, the newly generated .a3m file is printed to the screen, therefore the construction > [output.a3m] directs the result to the specified .a3m file.

perl GenFusMSA.pl -i1 [first fullQT.a3m] -i2 [second fullQT.a3m] -l [linker] -n [linker repeats]

Detailed usage and documentation of the software can be found in the GitHub repository.

### Molecular dynamics scripts

#### Motivation

Conventionally, one of the parts of fusion protein modeling includes molecular dynamics (MD) simulations[16]. Contemporary protein modeling approaches reach high enough accuracy that molecular dynamics simulations for refinement do not improve the modeled structure. However, since our systems include both rigid and flexible linkers, we thought molecular dynamics might be helpful to get the probable conformations to calculate the distance between active sites.

While developing our project, we came up with two types of MD simulations: atomic and coarse-grained (CG). For that, we wrote scripts in the Perl programming language. These scripts are written to run GROMACS[8] - a molecular dynamics package to simulate proteins, nucleic acids, and lipids.

#### Atomic

The atomic simulations take into account each atom in the system. These simulations are more accurate, yet they take more time to perform. We did not run the entire production run of atomic MD simulations because our modeled systems contain up to 330 000 atoms. To get a more reliable output, we would have had to simulate from 100 to 500 nanoseconds, which would demand lots of computational resources.

Scripts that can be found in the ‘atomic’ folder of the repository:

• copy_dir.pl - a script to copy the directory that contains the atomic MD running bundle.
• generate_system_gro.pl - a script to get the output to check how many atoms are in the simulated system.
• protein_md.pl - a script that runs required commands for GROMACS MD.
• clean.pl - a script that cleans the directory from generated files after running MD simulations.

#### Coarse-grained

On the contrary, CG simulations introduce abstractions to the simulated system. This type of MD uses the concept of ‘pseudo-atoms’ that represent groups of atoms. In this way, the representation of the system is reduced, and the system can be simulated using less computational power. This approach was reached by including MARTINI [9] - general-purpose coarse-grained force field - script into our MD simulation flow.

Scripts that can be found in the ‘CG’ folder of the repository:

• protein_CG_md.pl - a script that runs required commands for GROMACS coarse-grained MD.

#### Usage

More information about prerequisites and usage is provided in the GitHub repository of the tool. In general, if GROMACS (and DSSP for CG version) is successfully installed, then the following commands should run the MD simulations atomic and CG respectively:

perl protein_md.pl [path_to_protein_file]
perl protein_CG_md.pl [path_to_protein_file]

### RamachanDrawX

#### Motivation

Protein structural scientists use Ramachandran plots to gain better insight into the secondary structure of peptides. We used these plots to analyze the secondary structures of the modeled fusion proteins before and after molecular dynamics simulations (more information about fusion protein modeling can be found here ). Ramachandran plots represent the combinations of φ and ᴪ dihedral angles. Most pairs of φ and ᴪ are sterically forbidden since they introduce interatomic clashes . To determine regions and the varieties of dihedral angles in our system, we used RamachanDraw library. We thought that it would be beneficial to add the functionality of customization, therefore we contacted the author of RamachanDraw library, and with his consent, we wrote a small software tool RamachanDrawX that allows plotting Ramachandran plots more efficiently with a possibility to customize them for the desired design.

#### Description

RamachanDrawX is an easy-to-use command-line program based on a Python library to make Ramachandran plots. It contains the functionality of customization that consists of:

• choosing a colormap from existing ones in Matplotlib
• creating a colormap by providing hex color codes
• selecting colors for elements in the plot: dots, numbers, and axles
• changing the default font
• setting background transparency
• changing the opacity of the plot
• setting the resolution of the plot
• naming the output file

#### Usage

Prerequisites of the program are Python3 and its packages: biopython, mathplotlib, and rich. Each of the mentioned packages can be easily installed using conda Python package manager. It is recommended to create and activate a conda environment with mentioned packages installed. More detailed information about this step of setting your work environment can be found in the repository of the tool.

The following line can be called to get a plot with default parameters:

python3 RamachanDrawX.py [protein.pdb]

To check default parameters and get a manual for the usage of the program call:

python3 RamachanDrawX.py -h

Optional arguments that can be used for plot customization:

• --help - show this help message and exit
• --color_map CMAP - provide the color palette to create the colormap or use the existing colormap.
• --element_color EL_COLOR - set the color of dots, elements of axles, and labels (default: '#002733')
• --font FONT - path to the font directory of your choice (default: ./Fonts/Quicksand)
• --transparency TRANSPARENCY - an option whether to set transparent background or not (integer value 0 or 1; default: 1 (True) stands for transparent background setting)
• --alpha ALPHA - the opacity of the colormap (float value [0; 1], default: 1)
• --dpi DPI - the resolution (number of dots per inch, default: 1000)
• --output OUTPUT - the name of the output plot file (default: ./PNG/R_plot.png)
Fig. 7. The default Ramachandran plot generated by RamachanDrawX for 4-coumarate:CoA ligase (4CL)

### FusionPyMOL

#### Motivation

PyMOL [10] is an open-source molecular visualization program that is widely used in the field of structural biology. This year our team decided to include structural bioinformatics into our project and dive into the modeling of fusion proteins. This activity involved quite a lot of manual work to visualize the modeled structures. Therefore we decided to include the power of scripting to automate the process. Eventually, we came up with a library of modules with functionalities of coloring the complex system and calculating the distance between active sites.

#### Description

From a broad perspective, the repository contains a library of Python modules that can be embedded in new workflows related to fusion protein visualization.

Modules that can be found in the repository:

• parse_parameters.py - a module that contains a parameter parsing function. This function takes parameters from a parameter file that determines the way the program works.
• color_fusion.py - a module that includes subroutines to color the fusion protein system: the whole construct, the linker, and the active sites of each linked protein.
• calculate_distance.py - a module that contains functions to calculate distance between active centers when a structure with single or multiple states is provided.
• center_of_mass.py - a module that is used to determine the spots between which the distance is calculated.

#### Usage

Since a handful of arguments determine the program’s functionality, the first step is to define the parameters file. This file contains the attributes such as the path to fusion protein .pdb, the path to protein templates, the length of each protein, the linker, the color palette, and the path to the output file for distance statistics. Parameters allow customization of the workflow, and all of them are described in the README.md file in detail.

The second step is to choose the needed functionality from the provided suggestions. In order to present the library in the simplest way possible, there is a collection of example scripts submitted in the repository. They have the role of coloring the system and calculating the distance between the specified active sites of the linked proteins.

The visualization of the protein system is done by using:

pymol visualise.py -- param.txt

The calculation of distance between active sites is performed by using:

pymol calculate.py -- param.txt
Fig. 8. The example output of visualise.py script for 4CL:GSG:CHS fusion protein system

Regarding the information in the parameters file, the visualise.py script colors the system, loads templates for each linked protein, and aligns them to the modeled structure of fusion protein.

The calculate.py script can calculate the distance between active sites of the linked proteins. The fusion protein structure can contain one or multiple models (states). For example, the latter case occurs when the trajectory file from molecular dynamics simulations is converted to .pdb format. The script recognizes when the input .pdb file contains more than one state of the structure, calculates the distance between active sites for each state, and saves it to a text file.

Fig. 9. The example output of calculate.py script for 4CL:GSG:CHS fusion protein system (multiple states)

Apart from the example scripts to run the modules, the repository contains ‘example’ folder with the required files needed to demonstrate the full functionality of the library.

Additionally, there is a possibility to extend the usage of the provided library and adjust it to the user-defined scripts. In order to use the library, the user has to import the desired module in their script, which can be passed to PyMOL as an argument. Importing modules from the ‘functions’ folder can simply be done with a code fragment:

from functions import parse_parameters, color_fusion, calculate_distance
class Parameters:
pass
par = Parameters()
par.obj_name = ‘fusion’
parse_parameters.parse_parameters(sys.argv[1], par)

The main.py script in the repository is a script that contains the mentioned code excerpt, thus it can be used as the basement for the usage of other library modules.

pymol main.py -- param.txt

## References

1.
Shamriz, S., & Ofoghi, H. (2016). Design, structure prediction and molecular dynamics simulation of a fusion construct containing malaria pre-erythrocytic vaccine candidate, Pf CelTOS, and human interleukin 2 as adjuvant. BMC bioinformatics, 17(1), 1-15. To the article
2.
Yu, K., Liu, C., Kim, B. G., & Lee, D. Y. (2015). Synthetic fusion protein design and applications. Biotechnology advances, 33(1), 155-164. To the article
3.
Zimmermann, L., Stephens, A., Nam, S. Z., Rau, D., Kübler, J., Lozajic, M., ... & Alva, V. (2018). A completely reimplemented MPI bioinformatics toolkit with a new HHpred server at its core. Journal of molecular biology, 430(15), 2237-2243. To the article
4.
Pjotr Prins, Hilmar Lapp, Robert Davey, Richard Smith-Unna, Qingpeng Zhang, Peter Cock, Nathan T. Weeks, Morris Swertz, Matt Shirley, Konrad Förstner, Kevin Murray, Karl Broman, Josh Herr, Artem Tarasov, Joep de Ligt, Joachim Baran, Ilya Sytchev, Francisco Pina-Martins, Felipe Leprevost, Eric Talevich, Daniel Standage, Dan MacLean, Dan, Chris Fields, C. Titus Brown, Bruno Vieira, Bruno P. Kinoshita, Botond Sipos & John Prince, . (2014, August 18). MANIFESTO (Version v1.0). Zenodo. To the article
5.
Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., & Baker, D. (2020). Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3), 1496-1503. To the article
6.
Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., ... & Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science. To the article
7.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 1-11 To the article
8.
Abraham, M. J., Murtola, T., Schulz, R., Páll, S., Smith, J. C., Hess, B., & Lindahl, E. (2015). GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1, 19-25. To the article
9.
Monticelli, L., Kandasamy, S. K., Periole, X., Larson, R. G., Tieleman, D. P., & Marrink, S. J. (2008). The MARTINI coarse-grained force field: extension to proteins. Journal of chemical theory and computation, 4(5), 819-834. To the article
10.
DeLano, W. L. (2002). Pymol: An open-source molecular graphics tool. CCP4 Newsletter on protein crystallography, 40(1), 82-92. To the article
11.
iGEM Team Heidelberg, (2015). M.A.W.S. To the Wiki
12.
iGEM Team McMasterU, (2017), Genetic algorithm for sequence optimization. To the Wiki
13.
iGEM Team Athens, (2019), MEDEA To the Wiki
14.
Krüger, A., Zimbres, F. M., Kronenberger, T., & Wrenger, C. (2018). Molecular modeling applied to nucleic acid-based molecule development. Biomolecules, 8(3), 83. To the article
15.
iGEM Team Heidelberg, (2017), M.A.W.S. 2.0. To the Wiki
16.
Tseng, C.-Y., Ashrafuzzaman, M., Mane, J. Y., Kapty, J., Mercer, J. R., & Tuszynski, J. A. (2011). Entropic Fragment-Based Approach to Aptamer Design. Chemical Biology & Drug Design, 78(1), 1–13. doi:10.1111/j.1747-0285.2011.01125.x To the article
17.
Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic acids research, 31(13), 3406-3415. To the article
18.
Antczak, M., Popenda, M., Zok, T., Sarzynska, J., Ratajczak, T., Tomczyk, K., ... & Szachniuk, M. (2016). New functionality of RNAComposer: application to shape the axis of miR160 precursor structure. Acta Biochimica Polonica, 63(4), 737-744. To the article
19.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589. To the article
20.
Yan, Y., Tao, H., He, J., & Huang, S. Y. (2020). The HDOCK server for integrated protein–protein docking. Nature protocols, 15(5), 1829-1852. To the article
21.
Navien, T. N., Thevendran, R., Hamdani, H. Y., Tang, T. H., & Citartan, M. (2020). In silico molecular docking in DNA aptamer development. Biochimie. To the article
22.
Jeddi, I., & Saiz, L. (2017). Three-dimensional modeling of single stranded DNA hairpins for aptamer-based biosensors. Scientific reports, 7(1), 1-13. To the article
23.
Heiat, M., Najafi, A., Ranjbar, R., Latifi, A. M., & Rasaee, M. J. (2016). Computational approach to analyze isolated ssDNA aptamers against angiotensin II. Journal of biotechnology, 230, 34-39. To the article
24.
Wang, L., & Brown, S. J. (2006). BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic acids research, 34(suppl_2), W243-W248. To the article