Team:Sydney Australia/Model

Gene Clusters


23 genes, some similarities, some differences. But how do you split your 23 genes to give yourself the best shot at winning? It's akin to choosing people to go on a mission: how do you split people into squadrons to give your goal the best chance of success? You'd like people who've worked together before to stay together, they already know each other. You'd like to avoid introducing toxic people into the mix - toxicity in the workplace is so not on. You'd like people with similar work ethics to go together - one low-energy dud is going to bring down the morale of a group of excitable young sparks.

There are some constraints though - you only have limited resources. You are, unfortunately, not living in a utopia. You only have so much space in each team - it'll get too crowded otherwise. Some friendships will have to be cut for the sake of the greater good - the mission. You don't want too many teams either - that's too hard to coordinate.

In doing so, you need to choose your teams, but there are so many combinations you could go with, if you went in blindly (just ask a mathematician to permute 23 things into an unknown number of groups!) and most of the combinations would doom your mission. This is where statistical machine learning comes in - we need help from maths and computers to best inform our choice.

K Means Clustering

K-means clustering is the secret weapon we need. It works by finding the most "similar" people to put in teams, given you know how many teams you want. You decide that to ensure greater compatibility, it's important you have people with similar drive together - otherwise the mood will be horrible and the mission will fail. So, you base your clustering methodology on the work ethic of each person in your team.

This was applied to our genes. We can't insert all 23 genes in one go, so we needed to insert them in groups.

On a serious, scientific note, K-means clustering is an iterative, unsupervised machine learning technique which aims to minimise the distance between data points and the mean of each of the chosen number of clusters. The algorithm is iterative in that the process of allocating a data point to it's closest cluster is repeated until it 'stabilises' (i.e there isn't much change in the cluster allocations each time the process is run). Unsupervised machine learning is when data points have no 'label' (i.e no true output we are trying to predict) but rather we wish to find patterns or clusters within the data. This is perfect for our problem of trying to group together genes when we don't really have a true "score" or variable we are trying to map to.

We fed in transcriptome data from Wang et al. (2020) to R's built in K-means algorithm, with k set to be 5. Using transcriptome data gives us information about the expression levels on an RNA level, and allows us to suggest the specific function in natural transformation in E.coli.

The algorithm will find 5 clusters, while the other 3 clusters were chosen to accommodate insurmountable individual challenges.

Our first cluster, ComEA-ComF-ComA, was already decided on before the rest of the clustering. This was because literature reviews (Seitz & Blokesch, 2013) pointed out that competence genes were much more important for natural transformation than pilus genes. Thus, these genes were like our dream team, the commandos, the ones who you know have to team up otherwise the mission will likely be over before it begins. They have a proven track record, and thus reduce the size of your task to arrange the rest.

ComC is 4352 bases long, and thus cannot easily fit other genes in a sensible way to fit our 5kb limit. As longer synthetic gBlocks become more readily available, the clustering concerns with comC can be mitigated.

ComP, due to its extremely high transcription rate (6352) had to be treated individually to avoid skewing the algorithm. Figure 1. Our initial results of clustering genes

However, the world is not perfect, and we know we have some restrictions that aren't captured by our statistical modelling. Some team members just carry too much baggage - they will have to work alone, and others may need to be swapped about so we can make sure our strict WHS limits are reached. Our sponsors, Twist, only allow a 5kb limit for each cluster, so we had to take this into account.

A few clear standouts from this line-up are clusters 7 and 8, both of which contain only one gene, as explained above! The following changes are shown, with explanations to follow:

pilB was moved as the cluster formed was >5kb. The genes present in this cluster were not naturally adjacent to one another, so the choice of moving pilB and not the other genes was largely due to its size. It was moved to a cluster with a similar expression level.

fimT was moved as the cluster with pilT+pilU+fimT+pilB would be too large. pilT and pilU naturally appear in A. baylyi next to each other and partially overlap, so fimT was chosen as the gene to be moved elsewhere. It was moved to a cluster with the next most similar expression levels.

And even after we've chosen our groups, how do we know who to deploy first? And how do we know which groups should enter the fray first? Well this comes down to 2 main ideas. Firstly, the order of groups should be determined by the logical order in their role - if they're involved in the set up of the mission, they should start work first, and if there's a group that's doing final touches, they should go in towards the end. Within the groups, positions are determined by their work ethic, their drive to get this job done.

In regards to the pilus formation, we went with a bottom-up approach. Clusters were ordered so that the genes responsible for the structures found in the inner membrane of the cell were inserted first. This would provide the foundation for the rest of the gene products found in the periplasm or the outer membrane, their genes being inserted afterwards.

The following visualisation showcases the physical locations of the gene products of each gene cluster in the pilus formation:

With our clusters decided on, the next step in the design stage was to figure out how to actually get these clusters into E. coli.

Protein Modelling

Protein structure is fundamental to its function and behavior in biological systems such as in Acinetobacter balylyi. This string of pearls is a good visual analogy for a protein. Pearls are the building blocks of a pearl necklace and amino acids are the building blocks of a protein. They are bound to each other and form a chain. Now proteins don't typically exist as shapeless strings of amino acids as they have quite distinct features such as alpha-helix and beta-strand as well as domains and subunits forming even more complex protein-complexes. It has always been the interest to researchers including biochemists and microbiologists, to fully understand how the predicted protein structure could impact biochemical pathways or how they carry out biological activity. In which case for A.balylyi, the type-IV pilus protein assembly carries out the bacterial transformation via horizontal gene transfer. Therefore, it's our keen interest that we have constructed this table in protein modelling of Type-IV pilus system in A.balylyi.

Our protein modelling comprised of these sequential steps, which built a clear picture for visualisation reference purposes. In the first stage, we based off of the "pearls" translated amino acid which is the protein primary structure of each shortlisted gene of interest in all 8 gene clusters. Much like when you translate English into a different language, you will need the dictionary and the correct language formatting, this is carried out by ways of translating the codons into their respective amino acid residue one-letter codes. We then used the automated full-length 3D protein structure prediction tool Phyre2 to generate 23 pil/com models. (Kelley, Mezulis, Yates, Wass & Sternberg, 2015) This process is fairly easily executed as one would only need to copy/paste such translated amino acid sequences into a server that computes the complex automation process via a machine-learning algorithm and checks those sequence alignments accordingly. Then, the predicted model was validated using different quality metrics such as confidence and coverage, provided with a reference PDB entry in the RCSB database. (Berman, 2000)

ACIAD# Protein Name Predicted Function Protein 3D Structure via Phyre2 Confidence Coverage PDB
ACIAD0360 PilD Prepilin peptidase 98.9% 52% 3S0X The crystal structure of GxGD membrane protease FlaK
ACIAD0361 PilB Extension motor, ATPase 100% 77% 4PHT ATPase GspE in complex with the cytoplasmic domain of GspL from the Vibrio vulnificus type II Secretion system
ACIAD0362 PilC Inner membrane platform protein 99.9% 27% 3C1Q The three-dimensional structure of the cytoplasmic domains of EpsF from the Type 2 Secretion System of Vibrio cholerae
ACIAD0558 PilF Type IV pilus biogenesis/stability protein, motility 100% 82% 2VQ2 Crystal structure of PilW, widely conserved type IV pilus biogenesis factor
ACIAD0695 FlmT Minor pilin, Cell Adhesion 99.7% 75% 4IPU Crystal structure of Pseudomonas aeruginosa (strain: PAO1) type IV minor pilin FimU in space group P21
ACIAD0911 PilU Retraction motor, transport protein 100% 91% 5ZFQ Crystal structure of PilT-4, a retraction ATPase motor of Type IV pilus , from Geobacter sulfurreducens
ACIAD0912 PilT Retraction motor, Transport protein 100% 100% 5ZFQ Crystal structure of PilT-4, a retraction ATPase motor of Type IV pilus , from Geobacter sulfurreducens
ACIAD2639 ComA Transmembrane ATPase 100% 31% 2BIB Crystal structure of the complete modular teichioic acid phosphorylcholine esterase Pce (CbpE) from Streptococcus pneumoniae
ACIAD3064 ComEA DNA-binding protein 99.7% 46% 2EDU Solution structure of RSGI RUH-070, a C-terminal domain of kinesin-like protein KIF22 from human cDNA
ACIAD3236 ComF Helicase/Transferase 99.8% 56% 1VDM Crystal structure of purine phosphoribosyltransferase from Pyrococcus horikoshii Ot3
ACIAD3314 PilE Minor pilin, protein transport 99.9% 93% 1T92 Crystal structure of N-terminal truncated pseudopilin PulG
ACIAD3315 ComE Minor pilin, Cell adhesion 99.7% 82% 3SOK Dichelobacter nodosus pilin FimA
ACIAD3316 ComC Competence, cell adhesion 100% 29% 3HX6 Crystal structure of Pseudomonas aeruginosa PilY1 C-terminal domain
ACIAD3317 PilX Minor pilin 62.6% 49% 5WDA Structure of the PulG pseudopilus
ACIAD3318 ComB Minor pilin 98.9% 42% 6XXE CryoEM structure of the type IV pilin PilA5 from Thermus thermophilus
ACIAD3319 PilV Minor pilin 99.1 79% 6WXU CryoEM structure of mouse DUOX1-DUOXA1 complex in the dimer-of-dimer state
ACIAD3321 FimU Minor pilin 99.7% 95% 7E7O Cryo-EM structure of human ABCA4 in NRPE-bound state
ACIAD3338 ComP Major pilin 99.9% 95% 2PIL Crystallographic Structure of Phosphorylated Pilin from Neisseria: Phosphoserine Sites Modify Type IV Pilus Surface Chemistry
ACIAD3355 PilQ/ComQ Outer membrane secretin 100% 59% 6HCG Klebsiella pneumoniae type II secretion system outer membrane complex. PulD, PulS and PulC HR domain.
ACIAD3356 PilP/ComL Alignment complex protein 100% 57% 2LC4 Solution Structure of PilP from Pseudomonas aeruginosa
ACIAD3357 PilO/ComO Alignment complex protein 100% 55% 2RJZ Crystal structure of the type 4 fimbrial biogenesis protein PilO from Pseudomonas aeruginosa
ACIAD3359 PilN/ComN Alignment complex protein 97.7% 38% 4BHQ Structure of the periplasmic domain of the PilN type IV pilus biogenesis protein from Thermus thermophilus
ACIAD3360 PilM/ComM Alignment complex protein, Peptide binding protein 100% 99% 5EOX Pseudomonas aeruginosa PilM bound to ADP

In the proposed second stage, we plan to further validate the predicted protein-complex structures via NMR or mass spectrometry in proteomics study after protein extraction and purification. In the third stage, we will detect molecular hotspots of protein-protein interactions as well as protein-DNA interaction using machine learning classifiers trained with site-directed mutagenesis data. The rationale behind the three stages of our computational method is as follows. First, a predictive protein assembly model can be built if we combine molecular dynamics-derived features with experimental features. We would then be able to essentially assemble the whole 23 pil/com protein complexes and build a working hypothesis for the Type-IV pilus system. This approach does require extensive molecular dynamics and computational cost, with experimental site-directed mutagenesis data acquired. However, by combining experimental features and not strictly depending only on the model results computed, we may capture more information than only free energy terms. Sadly, with limited time and this year's delta outbreak COVID-19 lockdown impact in Sydney, we were unable to carry out all this experimental work and finish the proposed work in protein modelling.

We envision these proposed stages of protein modelling would be carried out in Phase 2 of iGEM team next year, to better inform of the interrelationships of these pilus and competence protein complexes, filling in the research gaps of both structural biology and synthetic biology.


  1. Kelley, L., Mezulis, S., Yates, C., Wass, M., & Sternberg, M. (2015). The Phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols, 10(6), 845-858. doi: 10.1038/nprot.2015.053
  2. Berman, H. (2000). The Protein Data Bank. Nucleic Acids Research, 28(1), 235-242. doi: 10.1093/nar/28.1.235
  3. Seitz, P., & Blokesch, M. (2013). DNA-uptake machinery of naturally competent Vibrio cholerae. Proceedings of the National Academy of Sciences, 110(44), 17987-17992.
  4. Wang, Y., Lu, J., Engelstadter, J., Zhang, S., Ding, P., & Mao, L. et al. (2020). Non-antibiotic pharmaceuticals enhance the transmission of exogenous antibiotic resistance genes through bacterial transformation. The ISME Journal, 14(8), 2179-2196. doi: 10.1038/s41396-020-0679-2