Model the process of RNA transcription system

In our experiment process, a very important part is to transcribe RNA molecules to facilitate subsequent research.

The transcription system is a very complex existence, involving rNTP, T7 RNAP, and the hydrogen ions and magnesium ions , so we chose to simplify the system. Here, the main substances we concerned about are four different rNTPs, the transcription product RNA (taking the self-degradation of RNA into account) and the reaction byproduct Pyrophosphoric (PPi). We will use a series of equations to calculate their concentration changes during the reaction.

According to the literature, the velocity in the transcription process is mainly determined by the concentration of compound of MgNTP. Therefore, the reaction rate is also related to the concentration of MgNTP and Mg2+. The concentration of T7 RNAP is also important. Here we have the reaction rate equation:1

kapp is the apparent transcription rate constant, and K1 and K2 are two different Michaelis-Menten equation parameters.

During the reaction, both magnesium ions and hydrogen ions will form compounds with NTP and PPi. NTP here refers to the lowest concentration of NTP in the reaction system (CTP in our reaction system). After a simplified reaction process, all the ions mentioned below constitute our equilibrium equation2. When we use their concentration value, the units will be unified to umol/L. It is worth mentioning that Tris is a buffering agent we use in the reaction. It will combine with H ions to control the pH of the reaction system at a stable value.

At the same time, the concentration change in the reaction needs to be measured by the following ODEs(ordinary differential equation)3:

In the process of solving the differential equation, we need to set the initial value, according to our experimental conditions:

In the end, we get the change of the concentration of substance we concerned in the system as shown in the figure. We also change initial NTP concentrations and T7 RNAP concentrations to obtain the concentration changes of RNA.

Figure 1 Concentration of different material in RNA transcription system

Figure 2 Variation of RNA concentration with different NTP or T7 RNAP concentration

Simple simulation for the De novo Rapid in Vitro Evolution of RNA biosensors(DRIVER) system

The process of DRIVER system modeling is more complicated. According to the data in the literature, after each round of PCR, if the ligand is added, the DNA that is not self-sheared will be amplified, and if the ligand is not added, the DNA that is self-sheared will be amplified. The amplification factor of each round of PCR is constrained by the number of reaction rounds. When ligand is added, there is a DNA molecule loss due to irresistible factors in the reaction before PCR amplification, and the loss ratio is 75%. At the beginning, we will randomly add ligand in 3-5 rounds, and other 3-5 rounds will not add the ligand, and the amplification factor will change from 4 to 8. In the subsequent process, we will randomly add ligand in 22-25 rounds, and other 25-28 rounds will not add the ligand, and the amplification factor will change from 1 to 4.

The figure below is the initial value we set and result we got after simulation. For the initial concentration of DNA with different shearing properties added to the library, we set it as shown in the left figure. Generally speaking, the middle is high and both sides are low, and there is a cliff-like decline. And after about 58 rounds DRIVER simulation, most of DNA appears has a very low concentration.

Figure 3 The result of DRIVER simulation

After DRIVER screening, the amount of the part we want (the area in the lower right corner) is already very significant. Due to the high initial content, there is still a small amount of residue in the middle, but the significance is not high.

Next Generation Sequencing(NGS) Data Analysis

1.NGS data’s source (CleaveSeq): Wet lab basic methods

The CleaveSeq protocol4 followed the same method as the cleaved selection rounds described above, up to the PCR step. At this point, the reaction was split and run through two separate PCR reactions, one that amplified the cleaved components with a “Z” prefix and the other that amplified the uncleaved components with a “W” prefix. The primers used in the above PCR reactions included 5′-overhang regions with Illumina adapters and barcodes to allow each read to be identified as to the assay conditions. In addition to the standard Illumina index barcodes embedded in the adapters, we also added 1–10 nucleotides of custom barcode nucleotides between the Illumina adapters and the prefixes or suffixes. The variable length barcodes introduce shifts of otherwise identical sequence positions in the prefix and suffix regions of the DNA being sequenced, resulting in more equal distribution of the four nucleotides at each position. This strategy improves the performance of Illumina sequencers’ clustering step, which relies on distinct sequences in adjacent clusters during the first 20 sequencing cycles. Also, at the input to the barcoding step, 15 distinct reference sequences, each with either a “W,” or “Z,” prefix, with five different lengths, and all similar to the library structure, are spiked-in at a fixed 18 pM concentration each. During the analysis, the number of reads of reference sequences provides a conversion factor for equating the number of reads with absolute concentration. The PCR reaction mixtures (1× Kapa HiFi enzyme, 1× Kapa HiFi buffer, 400 nM primers) were run for 18 cycles (under the following conditions: 98°C for 30s, 57 °C for 30 s, and 72 °C for 30 s).

The barcoded libraries were mixed in ratios based on the relative number of reads desired for each library and the libraries were diluted to 4 nM of DNA with Illumina adapters as quantified by qPCR (KAPA Library Quantification Kit). PhiX was spiked into the sequencing library at 10–20% of the total library concentration to further improve the cluster calling of the Illumina pipeline for amplicons. The libraries were sequenced on an Illumina platform, or HiSeq using 2 × 75 or 2 × 150 reads, depending on the data needs of a particular experiment, in each case using Illumina recommended loading guidelines.

2.Data analysis workflow

Analysis of NGS data was performed by a custom pipeline. The steps consisted of paired-end alignment using PEAR5, merge the result fasta files, reads-counting using the script. The references spiked-in at a known concentration prior to barcoding allow computation of absolute concentrations of each sequence with each prefix. The cleavage fraction of a sequence is computed as the ratio of the concentration of the sequence with the prefix corresponding to cleaved molecules (“W” or “Z” depending on the steps used to create the library) to the total concentration of that sequence. That is, the cleavage fraction, cs, for a particular sequence, s, is computed using:

Where r P,s are the number of reads sequence s with prefix P and r P,ref are the number of reads of the reference sequences with prefix P. The fold change of the cleavage fraction, f s under different ligand condition (e.g. +target or -target) is calculated as:

Unlike the simple ratio of the cleavage fractions, this formulation gives a fold change that should be predictive of the gene-regulatory activity ratio of the biosensor when used in vivo, where only uncleaved RNA molecules result in gene expression of the intact RNA. The factor k is used to compensate for slight variations in the experimental conditions in the two assays (i.e., −target and +target) that are unrelated to target presence. This factor is set to a value such that the median fold change over all sequences measured in the same run is 1.0 (since only a small fraction of sequences within each library are sensitive to any particular ligand).

Standard errors and confidence intervals for cleavage fraction, cs, and fold change of the cleavage fraction, fs, are dependent on the number of reads of each prefix in each condition and are computed using 1000 bootstrap samples drawn with replacement from the observed reads. Each bootstrap sample is used to compute estimates ^cs and ^fs. The 5% and 95% percentiles of the ^cs and ^fs are then used as the confidence intervals for cs and fs.

Standard error with fold change of cleavage fraction

Many methods have been devised for computing confidence intervals for the odds ratio of two proportions. Here we use the iterated method of Fleiss.

Iterated Method of Fleiss

Fleiss (1981) presents an improve confidence interval for the odds ratio. This method forms the confidence interval as all those values of the odds ratio which would not be rejected by a chi-square hypothesis test. Fleiss gives the following details about how to construct this confidence interval. To compute the lower limit, do the following.

For a trial value of ψ, compute the quantities X, Y, W, F, U, and V using the formulas

We use the middle of the confidence interval as the estimate of the value and half of the difference between the high and low as the error estimate.
(We analyze the data by the same methods as the reference article’s)

3.Pre-experimental data analysis example

We use a set of reliable data to verify our workflow. The source data are here(link)(reference article). Here we present the result of the pre-experiment.

Figure 4 NGS data analysis using article's data

Cleavage fractions in the presence and absence of the T1b ligand mixture for a DRIVER enriched library at round 74 (N = 4328; at least 30 reads/sequence in each condition); red dots indicate sequences with strong (>3×), significant switching;


Our wet lab experiment provided eight NGS samples. Then we analyzing the samples by the above workflow. Some key information is listed below.

After going through the analysis process, in the round 56 of sequencing results, we selected 18 sequences that met the conditions through data analysis. For these sequences, we also excluded reference amplification based on some biological theoretical analysis. In the case of reverse transcriptase mutations and other circumstances, the desired result will eventually be obtained.

Figure 5 NGS data analysis results

The figure above are the results of round 32, the figure below are the results of round 56. They are the technology replica; Red dots indicate sequences with strong (>3×),significant switching; Red dashed line is the Regression estimation.

RNA Aptamer Probe Design


Our purpose is to design a probe sequence that connects the 5' end and the 3' end for the Aptamer screened out by the NGS sequencing data analysis pipeline.

The combination of the designed probe sequence and Aptamer is the template synthesized by our subsequent experiments.


Figure 6 Workflow of RNA aptamer probe design

First, we will use random sampling or full sampling as appropriate to explore the sequence space of RNA probes (each aptamer has a corresponding probe sequence space).

Then, we will perform RNA secondary structure prediction for the aptamers with probes, and use the structure prediction information to evaluate and screen the probes, and finally select a suitable probe for each aptamer.

2.1 Sampling for probe sequence space.

2.1.1 Methods of sampling

The experimental requirements proposed in our wet lab require the modeling part to design a probe with a total length of about 40 nt for aptamer, so the degree of freedom of the probe sequence space at this time is 440. The freedom is too large to traverse all the sequences at this time. Therefore, in view of this situation, we need an efficient sampling method to reduce time cost. We thought of exhaustive method, sequence evolution, MCMC sampling and other methods and finally chose random sampling (that is, randomly generated RNA sequence as a probe). At the same time, we have also implemented the full sampling function, so that it can be used when the space degree of freedom of the probe sequence is not so large.

2.1.2 Filter

In order to reduce the burden of the model in the structure prediction part, we combined several basic principles of probe design to design a filter. In general, it can filter out 10-20% of the sequence obtained by random sampling or full sampling, to a certain extent It saves computing resources:

  • ◆ CG content: between 40%-60%
  • ◆ Repetition limit:
    • ◇ Single base repetition: 4 consecutive G and 5 consecutive A/C/U cannot appear.
    • ◇ Double base repeat: 5 consecutive dinucleotides cannot appear

2.2 RNA secondary structure prediction for aptamers with probes.

2.2.1 Software

We use existing software to predict RNA secondary structure. We deployed the Vienna RNA Package, a basic RNA secondary structure analysis software6. The software can output the structure prediction results as dot-bracket notation (.dbn file).

2.2.2 Verify software effect

We use the tRNA structure data in tRNAdb (Transfer RNA database) to test the effect of this software. A total of 601 tRNAs participated in the test, and the accuracy of secondary structure prediction was above 80%.

2.3 Evaluate and screen out probes based structure prediction results.

2.3.1 Evaluation and screening principles

Probe part: ensure that the part of the probe sequence is "free", that is, it does not complement the other parts of the sequence, and it is all "." in the dot brackets expression.

Aptamer part: to ensure the normal function of the nucleic acid aptamer, that is, the active center of the aptamer has a fixed base complementary pairing, which is expressed in a specific dot brackets expression.

2.3.2 Output file format

Currently, our workflow can output the screening results as files in different formats such as .fst and .dbn


Contact us

BioX-Institutes, Shanghai Jiao Tong University, Dongchuan Rd. 800

Copyrights© 2021 SJTU-BioX-Shanghai