Team:EPFL/Model

Modelling

Modelling of the flow-through reactor

When designing water flow systems, modelling is key to test different scenarios and geometries through simulation. This is due to the fact that testing physical parts is costly, both in time and money, and because results are difficult to visualise when dealing with water as it is transparent. Thus, flow simulations are immensely helpful as it is possible to see flow lines and pressure inside the system.

We used modelling to a great extent when trying to develop a continuous flow bioreactor. The first idea that came to our minds was a simple straight tube containing yeast with a filter at the exit. Water would flow through the yeast and the copper would gradually be filtered out. We identified the following issues with this design. Firstly, the yeast would surely clog the filter and stop the flow. Secondly, copper being gradually filtered out when the water advances through the system, it is only natural that there would be a gradient in copper concentration. This results in yeast cells closest to the entrance being saturated in copper sooner than those towards the end. In turn, this would force the user to either stir the yeast, replace only part of the cells or replace all of them but waste the cells that were still operational.

We thus wanted to create a system in which we avoid these problems. We imagined a circular system in which yeast cells would be trapped in an eternal cycle. Untreated water enters the system and completes a revolution. Then at the exit, the water is split and part of it may exit while the other carries the yeast back towards the beginning of the system, where it merges with new untreated water. We thus ensure that each water molecule completes one revolution at minimum, while some molecules will go through the system more than once as they accompany the yeast from the exit back to the beginning. We wanted the system to function using the water flow at the entrance without adding any pumps.

Figure 1Diagram of a flow-through reactor.

We quickly realized that the circled area in figure 1 required particular design attention as we were dealing with a flow adding system without being able to force pressure with a pump as one would commonly do. We thus began to design the connecting part thanks to flow simulations.

Simulation configuration

We used the software SOLIDWORKS for our flow modelling. The simulations are based on a cavity without friction at the boundaries. The liquid used was water and the environmental pressure is 1 atmosphere (110325Pa). Flow velocity at the input was constant and uniform over the cross section. Finally, the model uses laminar and turbulent flow.

How the SOLIDWORKS Model Works

The type of simulation software we used is called CFD (Computational Fluid Dynamics). Some modern programs like SOLIDWORKS allow the user to design parts using CAD (Computer-Aided Design) and then to simulate water flow within those parts using CFD.

As you may imagine, software that can simulate water flow in any arbitrary 3D structure requires a high level of sophistication. To understand the mechanisms behind the SOLIDWORKS software, we will explain how boundaries are defined and treated, we will look into the main fluid dynamics equations applicable in the regime we were interested in, and we will describe the numerical methods that the program uses.

Meshing

All CFD models aim to numerically represent real world physics. One problem is that the real world is continuous in time and in space, whereas a computer must break space and time up into small pieces to be able to use the physical models we know to be true. The program must thus break space up into little blocks that are called meshes.

Figure 2Cartesian mesh.1

SOLIDWORKS uses cartesian meshing which is said to be simple, fast and robust2. The resulting cells are identical cubes with their sides parallel to the spatial x, y and z axis1. The cells are then defined as being totally filled with a fluid, totally filled with a solid, or containing a boundary between solid and fluid.

The most difficult and arguably important cells to define are those containing a boundary. Two algorithms used in the SOLIDWORKS software are particularly useful in our case, as the pieces we designed had relatively complex geometries.

  1. Mesh presentation with resolution of edges within partial cells

    This method allows for better feature representation without adding additional cells.

    Figure 3Mesh presentation with resolution of edges within partial cells.2
  2. Cell Refinement to create sub cells

    This method is applied where boundaries are more complicated and allows for greater precision.

    Figure 4SOLIDWORKS flow simulation mesh after refinement.2

This method allows for better feature representation without adding additional cells.

Cell Refinement to create sub cells

This method is applied where boundaries are more complicated and allows for greater precision.

Physical Models

All CFDs aim at a solution to the Navier-Stokes equations, which are a set of equations developed by Navier, Poisson, St-Venant and Stokes3 that describe the movement of Newtonian fluids4. A fluid is considered Newtonian if its viscous stresses are linearly dependent to the change of its internal strain in time.

Additionally, CFDs also implement turbulence models among other physical models. In most cases, these do not have analytical solutions, thus complex algorithms must numerically approximate the solutions of the equations. Thankfully, physicists have thought of a wide range of sub-categories for these complex equations that make it possible to calculate their approximations.

Simulation of fluid only cells

The most computationally demanding part of the algorithm is the simulation of fluid cells. The Navier-Stokes equations express conservation of mass, momentum and energy. These three equations form a linear system that can be solved numerically.

Continuity Equation

Assuming no matter is spontaneously created or destroyed, the principle of conservation of mass in a small area of space is defined as the continuity equation3.

Figure 5Continuity equation.3

If one assumes water to be an incompressible liquid, meaning that its volume is constant and independent of pressure, then its density (rho) will be constant in space and time. Thus the only remaining term is the divergence of the flow speed which must be equal to zero (Shuyu Sun and Tao Zhang (2020)). More intuitively, in the case of a tube full of water, if a certain amount of water enters, then an equal amount of water must exit.

This equation is formulated as follows in the SOLIDWORKS software:

Figure 6Continuity equation used in SOLIDWORKS.2

Notice, as we will see for all equations formulated in the documentation, that the more complexe operations, like the divergence, are separated into their constitutive elements. This is due to the numerical analysis being based on derivatives.

Momentum equation

Figure 7Momentum equation.3

Another important principle is that of the conservation of momentum. The equation to be solved numerically for each small spatial continuum is the following:3

With the density ρ, the flow velocity u, the Cauchy stress tensor σ and the body acceleration g.

This equation stems from Newton’s 2nd law of motion5, thus we can compare the two to aid our intuitive understanding. The term on the left hand side of the equal sign can be understood as the fluidics equivalent to mass x acceleration, while the right hand side would be the sum of forces felt by the liquid.

Figure 8

The Cauchy Stress tensor results from viscosity shear stress (responsible for turbulent flow) and pressure. The momentum equation can thus be rewritten as follows:

In the SOLIDWORKS software, this equation is adapted as follows:

Figure 9Momentum conservation equation used in SOLIDWORKS.2

Energy Equation

The model used in most CFDs is the principle of conservation of energy. When applied to fluidics, the following is true:

Figure 10Principle of energy conservation for fluid elements.

The equation is expressed in SOLIDWORKS as follows:

Figure 11Energy conservation equation used in SOLIDWORKS.

Turbulent Flow

In general, a good measure of whether a fluid flow is laminar, turbulent or semi-turbulent is the Reynolds Number.

Figure 12

Where ρ is fluid density, u is flow speed, L is the characteristic linear dimension. μ is dynamic viscosity and 𝜈 is kinetic viscosity.

A fluid flow is turbulent if the Reynolds number is greater than 35006. To take into account the different flow types, SOLIDWORKS uses the Favre-Averaged Navier-Stokes equations and the Lam & Bremhorst modified k-ε turbulence model with damping functions.

The Favre-averaging method is a density weight averaging method7. It allows for more robust study of turbulent flow.

The k-ε turbulence model is a very common model used in CFD to simulate mean flow characteristics for turbulent flow conditions. The model consists of two transport equations. The first transported variable is the kinetic energy k. The second transported variable is the rate of dissipation of that kinetic energy ε(refl). The standard model used in most CFDs is the Launder and Spalding, 1974 standard k-ε model. SOLIDWORKS however, uses the modified k-ε turbulence model with damping functions8 which adds in turbulence damping functions.

Boundary layers

The main issue for cartesian meshing is boundary layers: the cells do not follow the boundaries as they would in other meshing strategies. SOLIDWORKS uses the two scale wall function (2SWF). 2SWF categorises boundaries as “thin” or “thick” depending on how they would influence flow.

When a boundary is complex (coarse, ragged, small curvature...) the program applies the “thin boundary treatment”. This means that precision is valued higher than low calculation times, thus few model simplifications are made.

On the contrary, when a boundary is simple (smooth, large curvature) the thick boundary treatment is applied. Here model simplifications are made to lower calculation time.

In intermediate cases, a combination of the two methods is applied and appropriate model simplifications are made.

Figure 13Boundary layer treatment.2

Numerical methods

When solving a numerical problem,the program treats the very complex operators used in the equations as more simple mathematical objects. The equations as formulated by SOLIDWORKS are the following:

Figure 14

Where the first equation is the continuity equation, the second is the momentum conservation equation, and the third is the energy conservation equation. The indices represent a row-column position in the tensors and vectors since linear algebra is used profusely to solve these types of systems numerically.

SOLIDWORKS implements two types of solvers. Only one was used for our simulations as it is dedicated to tasks where flow is incompressible or in flows where the Mach number is lower than 3. Our flows are both incompressible and have a lower Mach number than 1.

In this first solver, time-implicit approximations of the continuity and convection/ diffusion equations are used. Time-implicit approximations are a method in numerical analysis where the algorithm searches for the zeros of a function of both the current and the future values of the function, where time-explicit equations would calculate the future values based only on the current values9.

These time-implicit approximations are used together with operator-specific techniques. The operator-splitting technique is a method in which the physical models are split into sub-models to solve a specific part of the problem. When reassembled, this gives rise to a splitting error. However, without splitting, the more complicated models would take much longer to solve and be very error prone10. In the case of SOLIDWORKS for flow simulation, pressure and velocity are decoupled.

Conclusion for digging into CFD simulation

The Software used in CFD is complex. By describing the aspects that were important for the implementation of our particular use-case, we merely skimmed the surface of what the software can do. However, it is very interesting, as it encompasses complex and innovative physical as well as numerical analysis methods at the peak of current knowledge.

The SOLIDWORKS software uses the equations of continuity, of momentum and energy and applies various models (k-ε, Favre averaging,..) to simulate complex fluidic systems. In addition, thanks to its meshing methods and CAD software, we could design our parts and test them using the same program.

Flow simulations

We designed a large number of systems to try to find one that would function as intended. The following figures illustrate our thought process, but we chose not to show all of the simulations as getting from one part design to the next was a tedious task during which we ran through a plethora of iterations, tweaking the design little by little.

Parallel flow adding

Our first idea was to merge the flow in a parallel manner by making the cross section of the main pipe equal to the sum of the cross sections of the inputs.

We began by testing the design at very low flow speeds without success (Simulation 1).

Figure 15Simulation 1: Parallel flow adding at low flow speed.

We continued testing the system while increasing the water flow speed. We realised the importance of flow speed as the result was not simply that the existing flow became faster, but that entirely new flow lines appeared and some disappeared. We successfully created a system in which some of the water at the exit rejoins the water at the input (simulation 2). However, flow speeds of up to 8 m/s were far too great as it would demand a strong pump and we did not want to make the system too expensive.

Figure 16Simulation 2: Parallel flow adding at high throughput.

Perpendicular flow adding

We transitioned to a perpendicular flow adding design. Simulation 3 shows that water coming from the main input (untreated in our case) rejoins the exit directly. This was the opposite of what we desired as it would result in the following in our case of fixing copper with yeast: First, some of the water would not complete a revolution of the system and thus be untreated. Secondly, the yeast would not rejoin the start of the system, thus making it amass at the exit.

Figure 17Simulation 3: Perpendicular flow adding without a special geometry.

By simple observation of Simulation 3, it might seem obvious that the design would not work, so one may wonder why we decided to work on this. Adding the flow at an angle gives us the opportunity to imagine various geometries to produce a pinch off effect. By abruptly pinching the main input channel as in Simulation 4, we create an area in which flow is accelerated. Bernoulli’s principle in fluid dynamics states that a decrease of pressure occurs when a liquid gains in flow speed. By increasing flow speed locally, we hoped to create a pocket of low pressure right where we wanted the water from the main input and some of the water at the exit to to merge. This pocket of lower pressure would accelerate the water particles towards it, thus facilitating flow adding. Simulation 4 shows a first attempt at this pinch off geometry. It was unsuccessful, but the flow lines in the wrong direction did decrease.

Figure 18Simulation 4: Perpendicular flow adding with a pinch off geometry.

We then tried to pinch off the input channel more abruptly (simulation 5). Again, it was unsuccessful, but we had made progress by further reducing the flow lines that were in the wrong direction.

Figure 19Simulation 5: Perpendicular flow adding with more abrupt channel pinching.

When working on flow systems, we learned that much more needs to be considered than only the geometry of the main part. One very simple but important part of the system is the main tube. In all of the previous simulations, we made said tube’s cross-section smaller than the cross section of the main part. The reason for this followed the same logic as for the pinch off. When a liquid goes from flowing through a large cross section to flowing through a small cross section, it gains in speed as its mass must be preserved. As explained previously, Bernouilli’s principle in fluid dynamics allows us to assume that the higher speed will result in lower pressure. We thought this to be favourable as we thought that it would help the water to merge when entering the main tube. However, when testing this hypothesis, we realised that we were mistaken. The system depicted in simulation 6 varies from the previous iteration only in the width of the main tube. For the first time, we had successfully created a system in which a part of the flow at the exit rejoins the start and merges with the water entering through the input. However, we remained unsatisfied as the flow lines showed that the water would spin in two parts of the system.

Figure 20Simulation 6: Perpendicular flow adding with channel pinching and wider tubing.

Simulation 7 was our final and more interesting design. We kept the following concepts in mind that we had learned from the previous simulations :

  1. Flow merging at an angle.
  2. Channel pinching to create a pocket of low pressure.
  3. Wide main tube.

We tweaked the design little by little to achieve a more optimal system. Instead of perpendicular flow merging, we found that an inclined design yielded a better result. Additionally, we smoothened out the pinch off geometry, resulting in a pocket of low pressure (simulation 7). We had successfully created a system in which some of the water close to the exit would rejoin the cycle at its beginning without water spinning as observed previously (simulation 6).

This is the piece that we physically printed, connected with tubing and tested to determine whether the simulations were representative of the real world.

Figure 21Simulation 7: Pressure inside the final part. Pocket of low pressure at water merging point.

Biomodelling

Predicting the structure of our dimers

As described in our Design page, we decided, in order to improve our CUP1 and yeast surface display system, to dimerize our CUP1 protein, introducing linkers between each copy. To aid us in the design of said linkers, we consulted with Karla Castro, from the Laboratory of Protein Design and Engineering at EPFL. Among other things, she advised us to try as many different linkers as we could, given that there was no one rule we could apply to find an optimal linker.

After doing research on the most commonly used linkers, we came up with a first list of 26 dimers we wanted to test. However, creating 26 different primers and performing 26 simultaneous cloning experiments was not a realistic goal. We needed a way to narrow down the number of linkers we wanted to test. To accomplish this, we decided to use a structure prediction software.

Conditions for choosing a dimer

As previously stated, amino acid sequence influences the relevant protein’s three dimensional structure. In particular, several important amino acid characteristics participate in forming regular and easily recognizable protein structures. Thus, when designing our linker sequences, we paid particular attention to the following properties: polarity or hydrophobicity, size and overall flexibility.

These characteristics confer the following features to the three dimensional protein structure: flexibility, size (so distance between each CUP1 molecule) and folding.

Flexibility

This attribute was particularly interesting to look at for a variety of reasons. When designing our linkers, we were curious to see the effect of a flexible and non-flexible linker on the complex. On the one hand, additional flexibility could allow for the complex to be able to move more freely on the yeast surface, in a way reaching for more copper ions. However, too much flexibility in the linker could create a loop whereby the two CUP1 proteins could come into contact with each other. This would be detrimental to the proper functioning of the complex. Furthermore, rigidity between the two proteins could allow for some structure, ensuring a physical forced separation of the two.

Distance

Just as for flexibility, distance has an impact on the proper functioning of the complex. If our linker is too long, we risk damaging the process of transporting our protein complex to the surface with our yeast display system. Indeed, the larger a protein complex is, the more difficult it is to express it on the membrane, especially if this is not an endogenous process, which ours is not. However, if our linker is too short, our CUP1 proteins risk interacting with each other. Should they do so, our entire complex is non-functional.

Folding

Using the prediction simulation, we were able to see whether, at least theoretically, the entire complex would fold correctly. This is a crucial aspect of our protein structure.

Figure 22Polyproline rigid linker complex three dimensional structure, simulated with AlphaFold.Linker sequence: APAPAPAPAPAPAP.

Rigid linkers exhibit relatively stiff structures by adopting α-helical structures or by containing multiple Pro residues. Indeed, Pro is a unique amino acid with a cyclic side chain that causes a very restricted conformation. (doi:10.1016/j.addr.2012.09.039) Furthermore, the lack of amide hydrogen on Pro typically prevents the formation of hydrogen bonds with other amino acids, and therefore reduces the interaction between the linkers and the protein domains. As a result, the inclusion of Pro residues might increase the stiffness and structural independence of the linkers. (doi: 10.1042/bj2970249) Here we can easily see the polyproline structure in the middle of the two CUP1 slightly flected.

Figure 23First flexible linker three dimensional structure, simulated with AlphaFold.Linker sequence: GGGGSGGGGSGGGGS.

Flexible linkers are generally composed of small, non-polar (e.g. Gly) or polar (e.g. Ser or Thr) amino acids11.

This linker is mainly composed of glycines, but includes 3 serines. Glycines are the smallest and most flexible amino acids, due to their lack of side chain. They are thus able to rotate more freely in space, each glycine able to rotate in a different direction than the ones surrounding it. Serines are a bit less flexible and thus confer a bit more structure to the linker, hopefully keeping the two moieties from reaching each other. This linker is 16 residues long, of average length.

In addition, the incorporation of Serine or Threonine can maintain the stability of the linker in aqueous solutions by forming hydrogen bonds with the water molecules, and therefore reduces the unfavourable interaction between the linker and the protein moieties12.

Figure 24Second flexible linker complex three dimensional structure, simulated with AlphaFold.Linker sequence: GGGGSGGGGSGGGGSGGGGS

This linker is very similar to the one above, but is 20 residues long. We consider this linker to be relatively long. We were curious to compare the lengths of these two linkers to observe how our complex would behave.

Figure 25First semi-flexible linker complex three dimensional structure, simulated with AlphaFold.Linker sequence: GGGGSEAAAKGGGGS

This linker contains a rigid center, composed of one glutamic acid, three alanines and one lysine.

On the edge, we have added glycines and serines to emulate flexibility. We hoped this would create a rigid center with flexible edges. The advantage of the rigid structure is that it would keep the two protein copies apart.

Therefore, for the semi-flexible (or semi-rigid) conformation, we decided to combine the design choices used for flexible and rigid linkers. In particular, the main idea is to separate the two functional domains at an adequate distance to prevent any interaction between the two and maintain their independent functions while allowing extreme flexibility in the area close to the C and N terminus of the fusion proteins. In this way, the complex will be highly rigid in the centre and particularly flexible in the vicinity of the two protein termina allowing further degrees of freedom and proper folding.

Figure 26Second semi-flexible linker complex three dimensional structure, simulated with AlphaFold.Linker sequence: GGGGSEAAAKEAAAKGGGGS.

This linker is very similar to the one above, but is 20 residues long. We consider this linker to be relatively long. We were curious to compare the lengths of these two linkers.

Figure 27First linker complex three dimensional structure, simulated with AlphaFold.Linker sequence: EAAAKEAAAKEAAAK.

This linker is a rigid linker, mainly consisting of glutamic acids, lysines and alanines. As we can see by the prediction, these residues form a single alpha helix, spanning from one protein copy to the other. This is indeed a rigid structure - we had predicted as expected.

Generally, many natural linkers exhibited α-helical structures 13. These latter are characterized by high rigidity and stability, with intra-segment hydrogen bonds and a closely packed backbone. Therefore, the stiff α-helical linkers may act as simple rigid spacers between protein domains. In this case, the most common rigid linker is the alpha helix-forming linkers with the sequence of (EAAAK)n, which shows an approximately 80% helicity with n = 314.

Figure 28Second linker complex three dimensional structure, simulated with AlphaFold.Linker sequence: EAAAKEAAAKEAAAKEAAAK.

This linker is very similar to the one above, but is 21 residues long. We consider this linker to be relatively long. We were curious to compare the lengths of these two linkers. As we can see, the proteins come nowhere near in contact with each other.

Methods

Prediction software: AlphaFold

For the last 50 years, one of the most important open research problems in biochemistry has been predicting a protein’s three-dimensional structure based solely on its acid amino sequence. This three dimensional structure is based on the amino acid sequence’s folding mechanism. Thus if one can predict how the protein’s residues are to interact with each other and fold into regular structural elements (alpha helice, beta sheet), we can predict, based solely on the amino acid sequence, the structure of said protein.

While today it is common to predict the structure of a protein when similar structures are previously known, regularly predicting protein structure with atomic accuracy using computational methods was done for the first time by AlphaFold. They validated their neural network-based model in the 14th Critical Assessment of protein structure Prediction (CASP14), where they demonstrated strong accuracy in their model. This software uses a novel machine learning approach, incorporating both biological and physical knowledge about protein structure and comparing multi-sequence alignments.

AlphaFold is a prediction software capable of determining 3D protein structure from an amino acid sequence15. It uses an Artificial Intelligence system developed by DeepMind, an artificial intelligence research laboratory.

So far, computational methods to predict three-dimensional protein structures based on the amino-acid sequence have either focused on the physical interactions or the evolutionary history. The physical interaction programme integrates our current understanding of molecular driving forces into either thermodynamic or kinetic simulation of protein physics or statistical approximations. On the other hand, the evolutionary programme, a development in recent years, derives protein structure from bioinformatics analysis of the evolutionary history of proteins, pairwise evolutionary correlations and homology to already solved structures.

Figure 29Model architecture.Arrows show the information flow among the various components involved in the AlphaFold network. Array shapes are shown in parentheses with s, number of sequences (Nseq in the main text); r, number of residues (Nres in the main text); c, number of channels.15
Neural network

AlphaFold uses novel neural network architectures and training procedures based on geometric, physical as well as evolutionary constraints to ensure accurate structure prediction. Their main features include:

  • self-distillation and self-estimation methods allowing learning from unlabelled protein sequences
  • iteratively refining predictions using intermediate losses
  • embedding of multiple-sequence alignments (MSA)
  • masked MSA loss to jointly train with the structure, learning from unlabelled protein sequences using self-distillation and self-estimates of accuracy
  • a new output representation and associated loss that enable accurate end-to-end structure prediction

Using the primary amino acid sequence, and aligned sequences and homologues as inputs, AlphaFold predicts the 3D coordinate with atomic accuracy for a given protein.

The architecture of the network includes mainly two stages that allow the network to extract a large amount of different and complementary data making the model extremely accurate.

First the core of the network is built on a series of neural network layers that process the input data in an array Nseq × Nres, where Nseq represents the number of sequences and Nres the number of residues. This part of the architecture is called Evoformer and gives us an output that represents a processed Multiple Sequence Alignment (MSA) and a Nres × Nres array that represents the residue pairs.

To perform its function, the Evoformer blocks contain a number of attention-based and non-attention-based components. Typically, in the context of artificial Neural Networks, attention and non attention algorithms are approaches that tend to mimic standard cognitive attention. This model leads to an iterative process that continually refines the structural hypothesis behind the model, in this case based on biochemical, physical and evolutionary information.

The combination of this information is the breakthrough in the Evoformer block. This unit block indeed uses new mechanisms to allow a flow of information within the MSA framework and the pair representation, thus enabling a direct correlation between the spatial and evolutionary information.

Immediately after the Evoformer, it follows the structure model. This unit block defines the system of coordinates representing a global rigid body frame. It introduces a 3D structure representation in the form of spatial and geometrical components in terms of rotation and translation for each residue of the amino acids sequence.

All these representations are initialized in an original state with all rotations set to the identity and all positions set to the origin, but rapidly refine and develop an accurate protein structure shedding light on atomic details.

The crucial innovations in this section of the network are mainly based on a model able to break the chain structure, hence allowing for continuous local leveraging of any structural components. Furthermore a novel equivariant transformer is introduced for allowing the network to represent the side-chain atom. Finally, a new loss function is defined to guide the model weighting on the different orientation of the residues.

Thanks to this artificial neural network they defined an entire new frame not only to analyze and process structural information, but also to interpret the concept of iterative refinement through update of configuration based on the loss function behavior and the iterative update that work as a feedback system.

Evoformer

The crucial component of the artificial neural network developed by DeepMind is called Evoformer. The function performed by this building block is viewing the prediction structure as a graph inference problem in a three dimensional space in which residue’s proximity defines the border of the graph. Each pair encrypts information related to interaction and relation between each of the residues.

The MSA is represented as a table in which the rows correspond to the sequences in which the residues appear while the columns encode information on the single residue of the input sequence.

Considering this structure, it is possible to define a model based on a series of update configurations to apply iteratively in each block, thus updating the system at each time step with new information.

Regarding the MSA iteration process, it performs an update on the pair representation through an element-wise outer product that is summed over the MSA sequence dimension. Namely, it means that we update the relation of residues considering their evolution in different input sequences, enabling the continuous exchange of information from the evolving MSA representation to the pair representation.

This pair representation could evolve towards two possible update configurations. Those latter mainly stem from consistency boundaries. Indeed, regarding the pairwise description of the residues, many constraints must be fulfilled to be described under a 3D structure framework, one of which is the triangle inequality. Under a geometrical point of view this means that it is possible to apply the update iteration on the pair framework in terms of triangle edges between three different nodes (figure 30c). Furthermore, what really make possible the great innovation in the algorithm correspond to the addition of an extra bias applied to axial attention in order to consider the missing edge of the triangle, while the non attention iteration, namely called triangle multiplicative update, based the update of the missing third edge on the two other edges present. This whole iterative process is called triangle multiplicative update, originally developed as replacement for an attention algorithm.

Regarding the MSA framework, the algorithm uses again a variant of axial attention. In particular, the model is based on projecting additional logit from the pair data to bias the attention of the MSA representation. In this manner, it is possible to close the cycle of information exchange between the MSA and single pair representations, thus optimizing the amount of information extracted from a single sequence. This enhancement is given by the efficient combination of the two frameworks, which encode different but complementary information on the sequence. Furthermore, this ensure that Evoformer block is able to extract most of the information encoded in the sequences and in the residues, generating a highly reliable and accurate structure.

Figure 30Architectural details.15a, Evoformer block. Arrows show the information flow with shape of the arrays is shown in parentheses. b, The pair frame represented as edges in a graph. c, Triangle multiplicative update and triangle self-attention explanation. The circles represent residues. Entries in the pair representation are illustrated as directed edges and in each diagram, the edge being updated is the one corresponding to the pair ij.

As previously stated, AlphaFold’s methodology uses a combination of bioinformatics and physical approaches. In particular, they use a physical and geometric bias to build components. These components then learn from the PDB data that aren’t burdened with user interference. The result is a flexible network with a low training loss when training with the limited data from the PDB.

Display software: VMD

Visual Molecular Dynamics, or VMD for short, is a program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. VMD allows molecules to be drawn as lines, bonds, CPK, licorice, VDW spheres, ribbons, tubes, surface, secondary structure cartoons, points, C-alpha traces, and surfaces.

For the structures below, we used the ResName coloring method and the NewCartoon drawing method. This representation uses a combination of cartoons and ribbons.

The ResName coloring method colors vis-a-vis residue names, using the Resname category. We chose to differentiate residues on the final structure as our design began as an amino acid sequence. When designing our dimers, we chose the amino acid sequence of our linkers based on the particular properties of the residues. In particular, hydrophobicity, flexibility and size. These characteristics are the main ones responsible for creating regular protein structures such as alpha helices or beta sheets. Based on this, we were able to predict, even before running the simulations, the structure of our linkers. We used the simulations to confirm our suspicions. Thus, we felt it was logical to highlight residues, as they determine the final structure.

In particular, helices in the molecule are drawn as cylinders, beta sheets as solid ribbons, and all other structures (coils and turns) as a tube.

Using these softwares, we were able to successfully predict the structures of all 26 of our protein structures, including the linker structure. From this, we were able to select our 7 favorite dimers, combining flexible, semi-rigid and rigid models.

You can find below the predicted structures of our seven dimers that we later successfully cloned into our yeast. For each, you can see two CUP1 molecules, as well as the linker that joins them. For more information on the dimer design process, please consult the Design page.

References

  1. GiD Reference Manual
  2. Dassault Systèmes (2014)
    Numerical Basis of CAD-Embedded CFD
  3. Navier–Stokes equations
    Wikipedia
  4. Newtonian fluid
    Wikipedia
  5. Henrie, Carpenter & Nicholas (2016)
    Real-Time Transient Model–Based Leak Detection
    Pipeline Leak Detection Handbook, pp. 57-89
  6. Nuclear Power Com
    Reynolds Number for Turbulent Flow
  7. CFD Online
    Favre averaged Navier-Stokes equations
  8. Lam & Bremhorst (1981)
    A Modified Form of the k-ε Model for Predicting Wall Turbulence
    Journal of Fluids Engineering, vol. 103, no. 3, pp. 456-460
  9. Explicit and implicit methods
    Wikipedia
  10. Diab (2019)
    Operator Splitting Methods
    Diab's blog
  11. Argos (1990)
    An investigation of oligopeptides linking domains in protein tertiary structures and possible candidates for general gene fusion
    Journal of Molecular Biology, vol. 211, no. 4, pp. 943-958
  12. Chen, Zaro & Shen (2013)
    Fusion protein linkers: Property, design and functionality
    Advanced Drug Delivery Reviews, vol. 65, no. 10, pp. 1357-1369
  13. George & Heringa (2002)
    An analysis of protein domain linkers: their classification and role in protein folding
    Protein Engineering, Design and Selection, vol. 15, no. 11, pp. 871-879
  14. Li, Huang, Zhang, Dong, Guo, Yue, Yan & Xing (2015)
    Construction of a linker library with widely controllable flexibility for fusion protein design
    Applied Microbiology and Biotechnology, vol. 100, no. 1, pp. 215-225
  15. Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger, Tunyasuvunakool, Bates, Žídek, Potapenko, Bridgland, Meyer, Kohl, Ballard, Cowie, Romera-Paredes, Nikolov, Jain, Adler, Back, Petersen, Reiman, Clancy, Zielinski, Steinegger, Pacholska, Berghammer, Bodenstein, Silver, Vinyals, Senior, Kavukcuoglu, Kohli & Hassabis (2021)
    Highly accurate protein structure prediction with AlphaFold
    Nature, vol. 596, no. 7873, pp. 583-589