Team:MADRID UCM/Software

Cloning design - 4C_FUELS

Since its very beginning, synthetic biology has been fueled by the development of new algorithms and software specialized in the analysis of biological data. Within our team we have seen ourselves how some of our project needs could already be tackled taking a bioinformatic approach.

In this page you will read about the software tool our team has developed for the identification of novel neutral sites within any prokaryotic organism.

Neutral Sites and how to identify new ones

Discovering Novel Neutral Sites

If you are planning to modify an organism, you will need a genome edition tool. In the case of prokaryotic organisms this can be achieved using replicative plasmids or directly modifying the chromosomal genome. Direct modification of the organism genome in order to alter their natural features usually generates more stable mutants. In addition some organisms do not have known replicative plasmids, then the only possibility is the addition of their chromosomal DNA.

In prokaryotes the most common technique to achieve this goal is the integration of foreign DNA relying on the phenomena of homologous recombination. This way, one or two homology regions identical to the organism genome are assembled next to the genes to introduce. Then, a single or double recombination event introduces the genes of interest within the genome.

Cyanobacteria can be transformed via genomic integration, usually relying on a double recombination event, where the gene of interest is flanked by two homology regions targeting an specific genomic locus. This way, the gene is inserted in the middle of the DNA between both homology regions During the recombination event, part of this sequence is disrupted or totally removed from the genome.

Because of this, the genomic regions where the gene is inserted are usually selected as a locus that would not generate alterations in the features of the organism. Neutral Sites are the common name for this genomic locus where a gene of interest is inserted without other alteration of the organism features than the produced by the inserted gene.

Discovering new Neutral Sites

Only model organisms have a well characterized set of known neutral sites. When working with new biological chassis usually there are no identified neutral sites. The edition of novel organisms will require the identification of new integration sites. To do so, we have blended the biological knowledge of how a prokaryotic genome is organized with the power of informatics for the automated analysis of biological sequences.

But… How can you identify a good site for integration?

There are several possibilities, in our case we do not want to modify any important gene for the behavior of the organism. Then the modification of the integration site should not alter any essential or beneficial gene for the organism. To identify this DNA regions we have considered the following hypothesis:

If you look for a place in the genome where upstream there is an open reading frame (ORF) in the forward strand, and after that the next open reading frame (ORF) is in the reverse strand, then, the sequence between them could only be part of the transcriptional unit terminators or a noncoding DNA sequence. Then a gene could be inserted disrupting that sequence and only the spacer DNA or terminator regions would be affected.


In these regions where two consequent ORF converge, the DNA could noncoding DNA (ncDNA) and Terminator regions but also other elements such as small-noncodingRNAs (sncRNA) or ribosomalRNA (rRNA). However the prevalence of ncDNA is much higher than the others. In addition, the selection of a determined length of the sequence, will also limit its likeliness to be a sequence with an essential or beneficial role for the organism.

Then, most of these intergenic regions between convergent genes are potential neutral sites for genomic integration. Then we have created a python script able to read a genome base by base, identify these convergent genes and note the flanked sequences between them. This code is the Neutral Site Finder.

There are other strategies for the identification of neutral integration sites. The analysis of transcriptomic data to identify non-transcribed regions to be used as neutral sites or the identification of essential, beneficial and non-essential genes are other possible strategies. We have decided to use the convergent genes approach since it does not require the large amount of biological data necessary to apply the former mentioned methodologies. This way, our software could be used for potentially almost prokaryotic organisms without the need of any other information than the sequenced genome.

Neutral Sites Finder

The code

The Neutral SIte Finder is a python script coded using the biopython and regular expressions libraries. The code works in a way where two input files are required: one with the genome sequence in FASTA format, and a second one with the positions of the Open Reading Frames (ORFs) of the genome. Then, an output file (Results.txt) is generated. For every identified potential neutral site the output displays its sequence, position and flanking sequences to use as homology arms.

The output of the script can also be tuned modifying some code parameters.

  • Minimum length of the potential neutral site could be adjusted. The lower this length the higher the number of potential sites, however less than 40bp are not recommended to avoid alteration of the adjacent ORFs). Sites with a length around 100-300 bp are usually optimal for the search.

  • Sites to avoid the code include a python dictionary, where a specific sequence should be avoided within the neutral sites, and also the homology regions used for recombination. This feature is useful for avoiding restriction sites used during the cloning or creating new parts for a specific standard.

  • Homology regions length One a neutral site is identified and has all the features specified in the parameters, the code will calculate the sequence of the up and down homology arms for the genomic integration. Depending on the organism the length of these sequences can range between 50bp to > 1 kb. In the case of cyanobacteria a size of 900 bp (for each homology arm) has been selected for optimal recombination efficiency.

Once identified as a neutral site, by default the script takes the site sequence and starts counting bases upstream and downstream till reaching the required base pairs in each homology arm. The results are displayed with the neutral site in uppercase and the homology arms in lowercase.

Homology regions design

The script generates a list of potential DNA sequences that could be used as an integration site. Within each one of these sequences, the actual intergenic region and the flanking sequences which are part of the collindant ORFs.

The output of the script can also be tuned modifying some code parameters.

To design homology arms for integration within the neutral site, 30-50 bp in the middle of the neutral site should not be used; these bases will be removed during the recombination event. Then the homology arms up and down are defined as the next 800-1000 bp going in upstream or downstream directions.


Thanks to this design, only the bases within the intergenic region would be altered avoiding affecting the essential or beneficial genes. This way detrimental effects on cellular fitness are avoided. It is important to consider that the regions used as homology arms will remain unchanged within the genome. In any case, we highly recommend a manual revision of the selected neutral integration locus. A quick exploration searching the positions of the identified neutral site in a genome explorer. This search will confirm us that the region is found between two convergent coding genes, and also that this intergenic sequence does not correspond with a gene with a non-coding or essential function. Likewise, if there is available information about identified essential or beneficial genes for the organism, this should also be used to avoid integration sites in the proximity of these relevant genes.

In addition, if the integrated genetic device requires to be fully isolated from the surrounding genetic context, the inserte construct should be flanked by strong bidirectional terminators. This way, any undesired transcription from the collidant genes is stopped.

Get the code

In our project we have used this script to identify, characterize and design 4 novel neutral sites for the fast-growing Synechococcus PCC 11801 strain. We have carefully selected 10 promising neutral site regions and manually revised them, comparing their adjacent ORFs and the neutral site sequence with an essential genes datasheet for the closely related Synechococcus PCC7942.

However, our script can be used for the identification of neutral sites in almost any prokaryotic genome.

You can access the code and additional documentation about it in our GitHub repository.

Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., De Hoon, M.J.L., 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423.

Dempwolff, F., Wischhusen, H.M., Specht, M., Graumann, P.L., 2012. The deletion of bacterial dynamin and flotillin genes results in pleiotrophic effects on cell division, cell growth and in cell shape maintenance. BMC Microbiol. 12, 298.

Elnitski, L., Hardison, R.C., Li, J., Yang, S., Kolbe, D., Eswara, P., O’Connor, M.J., Schwartz, S., Miller, W., Chiaromonte, F., 2003. Distinguishing Regulatory DNA From Neutral Sites. Genome Res. 13, 64–72.

Chaves, J.E., Wilton, R., Gao, Y., Munoz, N.M., Burnet, M.C., Schmitz, Z., Rowan, J., Burdick, L.H., Elmore, J., Guss, A., Close, D., Magnuson, J.K., Burnum-Johnson, K.E., Michener, J.K., 2020. Evaluation of chromosomal insertion loci in the Pseudomonas putida KT2440 genome for predictable biosystems design. Metab. Eng. Commun. 11, e00139.