As part of our main objective of developing a useful toolbox for engineering K. phaffii, the PHEAST team put part of its focus on model development. It is usually difficult to find literature research and resources when first steps are done engineering a new strain. Instead, one must approach and learn about the system through theoretical models that will be iteratively improved once experimental data is obtained.

PHEAST modeling serves as a user-friendly software toolbox to perform experiments computationally with K. phaffii, enabling a deeper insight into its viability and performance as a cell factory under many different conditions.

In that sense, when engineering a new cell factory, there is always a very clear point to consider: will the modifications we plan to introduce to our organism affect its performance in some manner? Will it perform effectively enough to become a proper cell factory?, etc. All these questions point to the same thing: the metabolism of our cell factory.

For that matter, we developed two models for the analysis of our cell factory in two different levels of detail:

The first one is based on Genome-Scale Metabolic Models, or GSMs, where one can simulate the introduction and deletion of phenotypes, and by including different metabolic knowledge, estimate the different metabolic fluxes inside the cell.

The second one, called Promoter kinetics model, although mainly focused on our proof of concept of making K. phaffii an efficient cell-factory, it models the naturally found AOX system in K. phaffii . This is a more mechanistic model compared to the GSM as the model tries to reflect the time-dependent kinetics and enzyme production.Therefore, this is also a very useful tool for protein production prediction using AOX over time.

As an extra work, due to our requirement to attach signal peptides to our pMMO subunits as well as a Venus tag to PmoB, we found it was necessary to ensure these were not altering our protein structure and thus, its function. The clear way to go here was AlphaFold, the innovative Deep Learning model for protein structure prediction. This was developed in collaboration with iGEM Groningen team. Visit the Collaborations pageto see more detail on how they helped us!

Genome-Scale Metabolic Model

What are GSMs?

Making quantitative predictions of the whole metabolism flux has enormous effects such as the ability to predict biomass production or a particular protein yield, as well as optimize these processes [1]-[2].

In short, GSMs are genome-wide constraint-based models able to calculate metabolic fluxes using the Flux Balance Analysis (FBA) method [3]. The COBRA (COnstraint-Based Reconstruction and Analysis) toolbox is the most well-known and widely used for working with these models [4]. And in particular we worked with the python implmentation COBRApy.

The idea behind these models is to constrain the solution space for system dynamics by applying our knowledge such as bounds to an enzyme reaction rate, nutrient uptake, etc.. Once we have applied all our known constraints or developed the needed experiments for that, the solution is optimized for a specific objective function. In Fig. 1 below, this process is shown in a more visual way:

Figure 1: Constraining and optimization workflow followed by FBA (Flux Balance Analysis) for GSM metabolic fluxes [5].

The interesting part of these models is that they are built from the annotated genome sequence and thus the different reactions are related to the genes that express the enzymes catalyzing them, which do regulate the reaction flux. They also provide great flexibility when modifying already existing reactions or including new reactions. This clearly allows for many options for genetic engineering.

Which are our base models?

The GSM model for K. phaffii we used as base model is the “ihGlycopastoris”, published in 2015 [6]. This model is an improved version of the previously developed model for K. phaffii “iLC915” published in 2012, extending it with “both native and humanized N-glycosylation for recombinant protein production, but also an estimation of N-glycosylation of P. pastoris native proteins” [7].

Implementation of our genetic toolbox:

Working from these base models as a starting point, genetic engineering can be started thanks to the great number of methods the COBRApy package contains.

This is where we have created a user-friendly tutorial through many of these functionalities applied to a real genetic engineering project such as PHEAST. And not only that, but we have also included the procedure acting as an example of how one should incorporate model constraints by developing experiments. Allowing a proper initialization to any person interested in the Synthetic Biology community to the vast applications and usage of these models.

Considering the importance of keeping this educational focus for potential future iGEM teams and the Synthetic Biology community in general, we developed this part of the modeling with Jupyter Notebook, providing great detail for the different steps and commands. That is why we invite you to see it for yourself in our github repository!

We also include here a simple example we used to add our methane oxidation reaction by pMMO, a protein built by 3 subunits:

# add metabolites on different compartments in case , e.g. methane on the extracellular space

e_methane = cobra.Metabolite(






# define reactions new reactions with its bounds (obtained from literature or 

# personal experimental work)

methane_oxidation = cobra.Reaction(


            name = 'Methane Oxidation',

            lower_bound = 0, # meaning irreversibility

            upper_bound = 10 # including our knowledge


# add involved metabolites and stoichiometry of the reaction



        pheast.metabolites.e_methane: -1.0,

        pheast.metabolites.m1232: -1.0,

        pheast.metabolites.m1215: 1.0,

        pheast.metabolites.m139: 1.0,

# add gene dependency for pMMO reaction

methane_oxidation.gene_reaction_rule = '( pMMO_A and pMMO_B and pMMO_C )'

# add reactions to model


# or you can also knock-out some genes, e.g. knock-out AOX1 and AOX2 and make the cell 

# accumulate methanol



It is also important to mention that the constraining part requires meticulous research and/or development of ad-hoc experiments to obtain the proper constraint values. This shows a great interaction between modeling and labwork for knowledge enrichment.

Assumptions for model constraining:

Some of these experiments might require a lot of effort and it might also be that research has been done already. Therefore, we used some of the already available data from the literature. All those parameters assumptions are shown below:

Model assumptions

Table 1. Parameter assumptions for determining GSM constraints.

Parameter Range found on literature Value chosen References
Amount of protein in yeast cell 50% 50% [8]
Mass of pMMO 300 kDa 300 kDa [9]
pMMO turnover rate 0.5 - 2.5 molecules/s 1 molecules/s [10]
Protein expression under GAP promoter 4 - 36% 10 % [12]
Half-life pMMO 5 h 5 h [13]
Methanol toxicity for GS115 strain 7 % 7 % Results page. Biolector experiment
pMMO synthesis under GAP promoter 0.0069 mmol/gDW/h 0.0069 mmol/gDW/h Github repository. Process explained on detail in our toolbox.
Methanol uptake constraint 221.84 mmol/gDW/h 221.84 mmol/gDW/h Github repository. Process explained on detail in our toolbox.


After developing all the genetic engineering to properly model our system, different processes can be analyzed to see the behavior of our modified organism.

In that sense, we will show here some of the major results, but we actively invite the reader to check our repository out to enjoy the full experience!

As a first step we used our model to simulate the growth of the K. phaffii mutants on glucose, methanol and methane to predict the outcome of our genetic alterations. Our K. phaffii naturally grows on glucose and methanol, but for methane it requires some genetic modifications. The process followed was by introducing the pMMO in our model and running a simulation of the growth of the K. phaffii mutant on methane.

Figure 2: Code example for the genetic engineering process of pMMO introduction allowing methane uptake on K. phaffii. The reaction r1339 that is optimized refers to the biomass production, equivalent to growth rate, and the resulting units are mmol/gDW/h, where gDW refers to gram dry weight. a) Before genetic engineering K. phaffii is a natural methylotroph that cannot metabolize methane. b) After genetic engineering following the process shown above, K. phaffii is able to uptake methane and in steady-state is predicted to grow at a rate of 0.057 mmol/gDW/h. c) However, when a gene knock-out is performed on aox1 and aox2 as also explained above, the growth of our strain becomes infeasible again, as expected and seen by our experiments. Find it on the Results page.

Growth rates under different conditions as vey easy to predict using GSMs. Below we show the growth prediction when our strain was grown on a medium with just glucose, just methanol and just methane after our engineering.

Model growth rate predictions

Table 2. GSM predicted biomass production of our strain under different conditions.

Carbon source Allowed uptake Optimized biomass production
Glucose 1 mmol/gDW/h 0.0802 mmol/gDW/h
Methanol 1 mmol/gDW/h 0.0114 mmol/gDW/h
Methane 1 mmol/gDW/h 0.0057 mmol/gDW/h

However, this was an interesting point as so far our model did not include any information limiting the methane uptake, which would translate into a high methanol concentration inside the cell, known to be highly toxic even for a methylotroph. Here where we needed the help from our wetlab team investigate methanol toxicity. For that, we developed a methanol kill curve experiment that we ran in a Biolcetor. After the experiment, we found that an initial of 221.84 mmol/gDW/h.

Although this might seem a very high value, and it would probably require some further fitting, one must keep in mind that K. phaffii is a methylotroph, and as such it makes sense that our limiting factor is not found on methanol concentration but rather the efficiency on the pMMO expression system. But this helped us even more to confirm the viability of our cell in a methane-rich environment.

The final aim of PHEAST as a proof of concept is to become a cell factory for any protein of interest. First step in our process to achieve this is to grow K. phaffii on glucose to afterwards introduce it to our fermentor with methane and focus on protein production.

As a final step we also included a procedure for the introduction of heterologous proteins into our strain [6]. With this we included expression of pMMO and leghemoglobin into our methanotrophic strain. As our engineering for pMMO was after the constitutive GAP promoter, we also explain how to introduce constraints to this protein from literature research parameters and some assumptions mentioned in Table 1 to ensure its expression through our simulation. And to fit with the cell factory point of view, we changed the objective function from biomass production to hemoglobin production.

All this can be see in Fig.3 below:

Figure 3:For the prediction of optimal conditions for growth and protein yield to be as high as possible, our toolbox allows to perform simulations where the system is placed on different conditions. a) Optimized biomass production when forcing the system to grow on specific glucose and oxygen concentrations. b) Optimized protein of interest expression when forcing the system to grow on specific methane and oxygen concentrations after modifying the objective function to protein expression. c) Optimized protein production for different percentages of pMMO production for methane concentration in volume units for a temperature of 37° C.

Analyzing the different graphs independently, Fig. 3a showing growth predictions on glucose is telling us that our strain can perfectly grow in almost any glucose-oxygen conditions. As there is no toxicity-related effects, what we get from this is that as expected, the more glucose and oxygen you give to our cell, the better it grows. For reference, our model predicted for a glucose uptake of 10 mmol/gDW/h a biomass production of around 0.8 mmol/gDW/h.

Fig 3b. results required a more in depth analysis and pointed out some specificities of our system. Our main conclusions from this behavior are that our system must follow a correlation between oxygen and methane for the best protein production close to a 2:1 ratio for oxygen:methane, respectively. But then, of course, the higher are those uptakes following the ratio, the better is the protein production following a linear relation.

Our reasoning for the fact that below the 1:1 relation between oxygen and methane growth is infeasible is because we are forcing our model to uptake all the methane by setting a strong constraint in methane uptake. The systems needs at least the same amount of oxygen due to the oxygen requirement for Pmmo to perform the methane oxidation.

The gradient on protein production on the y-axis when increasing oxygen might be explained because when the relation is exactly 1:1, all oxygen is used for the uptake of methane and then all the protein we are producing comes from the amino acids uptaken from the environment. After that, as soon as we have more oxygen available, the cell is able to produce its own DNA, RNA and amino acids.

The upper limit on oxygen is also because we are forcing the cell to take all the oxygen. There must be some other reaction besides pMMO that is helping to get that oxygen inside the cell, otherwise we would just get a x=y straight line and all the rest would be infeasible. Anyway, the idea is that once these reactions reach their limit for oxygen, the model is infeasible.

Predictions here are telling us that for a very high methane uptake of 10 mmol/gDW/h we would get around 0.07 mmol/gDW/h of leghemoglobin production. However, this methane uptake also requires a high oxygen uptake, that might be an issue in a fermentor.

Finally, briefly mention after our assumptions for pMMO activity, Fig.3c shows how the leghemoglobin expression was affected by the expression of pMMO as percentage of active pMMO proteins uptaking methane, and the methane available in the environment in volume. And from that figure we conclude that is not that mich about the amount of pMMO proteins working but rather whether the methane concentration is enough or the metabolism to produce the protein. Therefore, the systems might work faster or slower depending on the amount of pMMO but we would get protein expression if methane concentration is high enough.

Model limitations:

While it is debatable what to optimize for as it represents the biological systems goal, GSMs have been shown to predict growth rate very well when optimized for it. Hence the objective function is usually biomass production, while they still lack some work for protein production [6].

Besides, these models are based on the assumption that we work in a steady-state situation, completely losing the evolution of our system and its protein production, and tipically have no regulatory information. And those were the main reasons that brought us to develop the Promoter kinetics model that you can find on the next tab.

Future steps:

In the future, we would proceed developing some specific experiments to improve our constraints. Especially, we would focus on getting the real value of pMMO expressed on our engineered K. phaffii, as well as the methane uptake developed by it.

For sure, getting some experimental data of our final cell factory will be used to verify our predictions and work on improving our model to fit better this data.


[1] Edwards, J. S., Ibarra, R. U., & Palsson, B. O. (2001). In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data. Nature biotechnology, 19(2), 125-130.
[2] Doerks, T., Copley, R. R., Schultz, J., Ponting, C. P., & Bork, P. (2002). Systematic identification of novel protein domain families associated with nuclear functions. Genome research, 12(1), 47-56.
[3] Orth, J. D., Thiele, I., & Palsson, B. Ø. (2010). What is flux balance analysis?. Nature biotechnology, 28(3), 245-248.
[4] Ebrahim, A., Lerman, J. A., Palsson, B. O., & Hyduke, D. R. (2013). COBRApy: constraints-based reconstruction and analysis for python. BMC systems biology, 7(1), 1-6.
[5] Fang, X., Lloyd, C. J., & Palsson, B. O. (2020). Reconstructing organisms in silico: genome-scale models and their emerging applications. Nature Reviews Microbiology, 18(12), 731-743.
[6] Irani, Z. A., Kerkhoven, E. J., Shojaosadati, S. A., & Nielsen, J. (2016). Genome‐scale metabolic model of Pichia pastoris with native and humanized glycosylation of recombinant proteins. Biotechnology and bioengineering, 113(5), 961-969.
[7] Caspeta, L., Shoaie, S., Agren, R., Nookaew, I., & Nielsen, J. (2012). Genome-scale metabolic reconstructions of Pichia stipitis and Pichia pastoris and in silico evaluation of their potentials. BMC systems biology, 6(1), 1-14.
[8] Ertugay, N., & Hamamci, H. A. L. U. K. (1997). Continuous cultivation of bakers' yeast: change in cell composition at different dilution rates and effect of heat stress on trehalose level. Folia microbiologica, 42(5), 463-467. [9] Ross, M. O., & Rosenzweig, A. C. (2017). A tale of two methane monooxygenases. JBIC Journal of Biological Inorganic Chemistry, 22(2-3), 307-319.
[10] Hakemian, A. S., & Rosenzweig, A. C. (2007). The biochemistry of methane oxidation. Annu. Rev. Biochem., 76, 223-241.
[11] Sirajuddin, S., & Rosenzweig, A. C. (2015). Enzymatic oxidation of methane. Biochemistry, 54(14), 2283-2294.
[12] Várnai, A., Tang, C., Bengtsson, O., Atterton, A., Mathiesen, G., & Eijsink, V. G. (2014). Expression of endoglucanases in Pichia pastoris under control of the GAP promoter. Microbial cell factories, 13(1), 1-10.
[13] Geisberg, J. V., Moqtaderi, Z., Fan, X., Ozsolak, F., & Struhl, K. (2014). Global analysis of mRNA isoform half-lives reveals stabilizing and destabilizing elements in yeast. Cell, 156(4), 812-824.
[14] Christiano, R., Nagaraj, N., Fröhlich, F., & Walther, T. C. (2014). Global proteome turnover analyses of the yeasts S. cerevisiae and S. pombe. Cell reports, 9(5), 1959-1965.

Promoter kinetics model

A Kinetic Promoter Model For Temporal Resolution

In order to obtain time dynamics of protein production as well as another measure of production yield which are crucial for an economic analysis and development of the fermentation process, we used a kinetic promoter model for transcription and translation. In addition, such a model allows us to include parameters about the methanol induction of the AOX promoters and thereby model protein production as a function of the level of methanol.

Thus we constructed a system of ordinary differential equations describing the production of mRNA from the GAP and AOX promoter, respectively.
The basic expressions for the rate of mRNA and Protein production are:

Figure 1: Promoter Kinetics

Fig. 1 describes the general equation for the rate of change of mRNA and Protein which are represented by Y. Beta is the production term which is a Hill function in the mRNA case in our model. Alpha represents the degradation rate of the respective molecule which is multiplied by the current level of mRNA or Protein, respectively.

The full system of equations with additional terms such as ones representing transcription repression can be found in the respective code on our Github repository.

In order to obtain useful and comprehensible units we used conversions which can also be found in the code on Github making use of parameters like the expected amount of yeast cells in a fermentation solution (see Table 1 and References).

We based the parameters for mRNA and protein production and degradation on experimental results for S. serevisiae as this yeast is much more comprehensively studied while promising to be an acceptable approximation of our system at this level of complexity (Table 1). After a literature review about the kinetics of methanol induction we extrapolated a Hill coefficient of 1.264 and Km of 0.187 characterizing the activation of the AOX promoter (Table 1).

Our models predict a protein production of 0.01 - 0.35 gram/gram of yeast dry weight(g/gDW) at methanol levels of 0.01 - 5 % (Fig. 2). These levels are reached at about 15 - 30 hours after induction with methanol (Fig. 2).

Fig. 3 shows how methanol concentration builds up with pMMO concentration. When high enough pMMO levels and therefore methanol is reached, Leghemoglobin starts to be produced. We recognize that the linear accumulation of methanol is not realistic as it is consumed by the cell. Due to lack of useful sources as well as time constraints for own experimental work we could not include this process appropriately.

Figure 2 Protein Production From AOX Promoter Under Different Concentrations of Methanol
Figure 3: Temporal Development of PHEAST Methane Utilization and Protein Production

Model Parameters

Table 1. Kinetic Promoter Model Parameters (* Parameters from S.cerevisiae)

Parameter Value Unit References Comment
Transcription Rate* AOX: 1/12
GAP: 1/18
molecules/second [1]
AOX: Higher rate than average in dataset [1] as AOX is relatively strong
GAP: 2/3 of AOX as it is generally considered weaker (all References)
Translation Rate* Leghemoglobin (160 AAs): 4.6e-2*5
pMMO (922 AAs): 8.1e-3*5
molecules/second [5]
Rate of ~7.5 AA/second taking AAs of protein into account [5]
Conservative guess of 5 ribosomes per mRNA (factor 5) [6]
mRNA Degradation* 6.7e-4 molecules/second [7] Most yeast mRNAs have a half-life of 20-30 minutes [7]. We take 25 - 1/1500 per second.
Protein Degradation* 1/14400 molecules/second [8]
We take a half-life typical of stable proteins - ~ 5 hrs [9] & [10]
Hill Coefficient AOX 1.264 -- [10] We fitted the approximate values [10] published and extrapolated the Hill coefficient with a script found on our github.
K_m TF AOX Content 6 Content 6 [10] We fitted the approximate values [10] published and extrapolated the K_m with a script found on our github.
Yeast Dry Weight* 4.6e-11 gram [11] Measurement based on Photopette® handheld spectrophotometer at OD600 [11]
Protein content Yeast* 50 % [12] Based on dry weight [12]
Dry Yeast Per Liter (Fermentation) 25 gram [13]
[13] and [14] state that P. Pastoris, thus K. phaffi,
can reach beyond 100g cell densities at optimal levels and we take a conservative amount of 25 g.


[1] Pelechano, V., Chávez, S., & Pérez-Ortín, J. E. (2010). A complete set of nascent transcription rates for yeast genes. PloS one, 5(11), e15442.
[2] Zhang, A. L., Luo, J. X., Zhang, T. Y., Pan, Y. W., Tan, Y. H., Fu, C. Y., & Tu, F. Z. (2009). Recent advances on the GAP promoter derived expression system of Pichia pastoris. Molecular biology reports, 36(6), 1611-1619.
[3] Várnai, A., Tang, C., Bengtsson, O., Atterton, A., Mathiesen, G., & Eijsink, V. G. (2014). Expression of endoglucanases in Pichia pastoris under control of the GAP promoter. Microbial cell factories, 13(1), 1-10.
[4] Mellitzer, A., Weis, R., Glieder, A., & Flicker, K. (2012). Expression of lignocellulolytic enzymes in Pichia pastoris. Microbial cell factories, 11(1), 1-11.
[5] Bonven B, Gulløv K. Peptide chain elongation rate and ribosomal activity in Saccharomyces cerevisiae as a function of the growth rate. Mol Gen Genet. 1979 Feb 26 170(2):225-30
[6] Riba, A., Di Nanni, N., Mittal, N., Arhné, E., Schmidt, A., & Zavolan, M. (2019). Protein synthesis rates and ribosome occupancies reveal determinants of translation elongation rates. Proceedings of the national academy of sciences, 116(30), 15023-15032.
[7] Geisberg, Joseph V., et al. "Global analysis of mRNA isoform half-lives reveals stabilizing and destabilizing elements in yeast." Cell 156.4 (2014): 812-824.
[8] Christiano R, Nagaraj N, Fröhlich F, Walther TC. Global proteome turnover analyses of the Yeasts S. cerevisiae and S. pombe. Cell Rep. 2014 Dec 11 9(5):1959-65. doi: 10.1016/j.celrep.2014.10.065. Supplemental Information p.S12 table S4
[9] Belle A, Tanay A, Bitincka L, Shamir R, O'Shea EK. Quantification of protein half-lives in the budding yeast proteome. Proc Natl Acad Sci U S A. 2006 Aug 29 103(35):13004-9 p.13004 right column 4th paragraph
[10] Anggiani, M., Helianti, I., & Abinawanto, A. (2018, October). Optimization of methanol induction for expression of synthetic gene Thermomyces lanuginosus lipase in Pichia pastoris. In AIP Conference Proceedings (Vol. 2023, No. 1, p. 020157). AIP Publishing LLC.
[11] Direct yeast cell count at OD600 - Tip Biosystems. (n.d.). Retrieved October 18, 2021, from
[12] Pacheco, M. T. B., Caballero-Cordoba, G. M., & Sgarbieri, V. C. (1997). Composition and nutritive value of yeast biomass and yeast protein concentrates. Journal of nutritional science and vitaminology, 43(6), 601-612.
[13] Verachtert, H. (1989). In Yeast: Biotechnology and Biocatalysis (p. 410). CRC Press.
[14] Siegel, R. S., & Brierley, R. A. (1989). Methylotrophic yeast Pichia pastoris produced in high‐cell‐density fermentations with high cell yields as vehicle for recombinant protein production. Biotechnology and bioengineering, 34(3), 403-404.

Protein structure prediction with AlphaFold

Why do we need a protein structure prediction?

As a final part of our modeling, we were aware of the great challenge of ensuring that our pMMO protein subunits end up in the right part of the cell, being this the cell membrane. Our approach for facilitating the attachment to the membrane, we added to our different subunits a natural signal peptide from K. phaffii to all subunits, as well as a Venus tag to one of the subunits to be able to visualize that indeed our approach was working.

In order to consider in advance whether the inclusion of these new amino acid sequences would affect our protein structures, we wanted to develop a prediction of potential structure modifications those new amino acid sequences might cause.

Why AlphaFold?

Interestingly, while developing our project, the publication about AlphaFold2 came out for the novel protein structure prediction using AI [1]. Briefly, this software is able to predict the protein structure from its amino acid sequence. And we saw it as a great opportunity to implement this innovative technology to solve our doubts with the protein structure.

Therefore, with the collaboration of the Groningen team who had experience on this technology for their project, we analyzed the changes in our protein structures.

Finally, our modeling team was in charge of analyzing any potential modification on the protein structure for the different pMMO subunits that could be caused due to the introduction of signal peptides and Venus tag to ensure right localization in the cell.


It is well known that the protein structure is crucial for it to perform its function as expected. But in order to ensure the different pMMO subunits ended up in the right localization within the cell we had to introduce a native signal peptide on the N-terminal as well as the Venus tag on PmoB.

Basically, the different protein modifications are:

  • PmoA with signal peptide
  • PmoB with signal peptide and Venus tag
  • PmoC with signal peptide

Therefore, while having a meeting with the Groningen team we learned they were already using AlphaFold for the prediction of proteins for their project. As none of our team members had any experience with the tool, it was at that point we realized it was a great opportunity to include this analysis to our project ensuring a proper protein folding.

Kindly, after sending the different protein sequences to the Groningen team, they ran them on their supercomputer and obtained the different predictions. The CIF (Crystallographic Information Files) format files we obtained included the overlap of the original proteins with its modified versions.

Below you can find some animations we developed on PyMOL (reference pymol) for the different cases:

Fig.1 - AlphaFold protein structure prediction for PmoA vs. PmoA with signal peptide.
Fig.2 - AlphaFold protein structure prediction for PmoB vs. PmoB with signal peptide.
Fig.3 - AlphaFold protein structure prediction for PmoC vs. PmoC with signal peptide.
Fig.4 - AlphaFold protein structure prediction for PmoB vs. PmoB with Venus tag.
Fig.5 - AlphaFold protein structure prediction for PmoB vs. PmoB with signal peptide and Venus tag.

Root-Mean Square Deviations for structure similarity

Besides, PyMOL allows for the obtainment of the RMSD (Root-Mean Square Deviations) of atomic positions between two sequences. RMSD is the most well-known quantitative measure for protein structure similarity by comparing the aligned atomic coordinates.

where it follows the usual structure for Root Mean Square Error and di is the distance between the aligned atoms while averaging over the n pairs of equivalent atoms[2].

Usually, the threshold established for the similarity between two protein structures to be considered a success is when the RMSD is lower than 3 Å [3].

Below we show the RMSD for our different cases considered:

RMSD between AlphaFold predicted protein structures

Table 1. RMSD between the different pMMO subunits with the amino acid sequences added for correct localization in the cell.

Proteins to compare RMSD Units
PmoA vs. PmoA with signal peptide 1.184 Å
PmoB vs. PmoB with signal peptide 1.186 Å
PmoC vs. PmoC with signal peptide 3.400 Å
PmoB vs. PmoB with Venus tag 1.601 Å
PmoB vs. PmoB with signal peptide and Venus tag 1.369 Å

From the values obtained, we can conclude the signal peptide should not produce such a big difference in PmoA and PmoB structures. Probably due to its short length of 14 amino acids, not modifying much the amino acid interactions. However, the RMSD for PmoC when including the signal peptide of 3.400 Å is right at the limit for considering both structures similar. Therefore, in future steps we should consider this as a potential issue and probably consider a new signal peptide that allowed a better protein structure conservation.

For the Venus tag, although this is a bigger structure, the RMSD on PmoB is well below 3 Å and therefore it should not produce any issues to our protein function. Interestingly, the RMSD obtained from AlphaFold predictions is lower in PmoB when both signal peptide and Venus tag are added than just Venus tag. Therefore, we assume when both amino acid sequences are added to PmoB, in some way these two structures compensate and we obtain a structure more similar to natural PmoB.

Although this was not an specifically searched effect by our design, it is a very good information as it is telling that when both new sequences are added, out protein is more likely to behave as expected.


After all this data, we would conclude that, apart from some minor consideration for PmoC, our introduction of signal peptide and Venus tag to the different subunits for pMMO do not have a great effect on their structure and therefore we would probably succeed. Unfortunately, due to the lack of time, we could not really compare these results with experimental data.


[1]Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
[2] Kufareva, I., & Abagyan, R. (2011). Methods of protein structure comparison. In Homology Modeling (pp. 231-257). Humana Press.
[3] Chothia, C., & Lesk, A. M. (1986). The relation between the divergence of sequence and structure in proteins. The EMBO journal, 5(4), 823-826.