Sequencing analysis
The analysis of the sequencing data is the last step of DRIVER. The analysis identifies possible biosensors in the libraries, which can subsequently be tested in the lab. Here, we describe how we build our sequence analysis pipeline.
Introduction
The analysis of the sequencing data is the last step of DRIVER. The analysis identifies possible biosensors in the oligonucleotide libraries. The found sequences can subsequently be tested in the lab to verify whether they show affinity for the target ligand. We developed our own analysis pipeline based on the work of Townshend et al. and the CleaveSeq protocol [1, 2]. The analysis was implemented in the Python programming language, as it is open-source and thus contributes to the values of iGEM. To keep the analysis pipeline well structured, we used the open-source workflow management system Snakemake [3]. The source code can be studied here.
Sequence structure
DRIVER sequence structure
The structure of a sequence in the original DRIVER protocol is as follows:
variable barcode - prefix - potential aptazyme - suffix - variable barcode
- Variable barcode: a variable-length fragment of 1 to 10 nucleotides that is added to improve the Illumina sequencing performance. It does not carry any additional information needed for the analysis. Hence, this will not be used for downstream steps.
- Prefix: indicates whether an aptazyme sequence was cleaved or uncleaved during the previous round.
- Potential aptazyme: potential aptazyme sequence, the hammerhead ribozyme stem with random loops.
- Suffix: a sequence that is needed for the amplification step in DRIVER and for the Next Generation Sequencing (NGS) adapter ligation.
- Variable barcode: another variable barcode with the same characteristics as the first one.
Adapted sequence structure
We adapted the original sequencing protocol. Hence, our sequence-structure differs from the original sequence structure. To demultiplex the sequencing data, we added known barcodes in front of the prefix and after the suffix sequences. The prefix indicates whether a sequence belonged to the condition ligand not present (-ligand) or the condition ligand present (+ligand) during the previous round. The suffix indicates what library a sequence belongs to. In our sequencing protocol, the different libraries were sequenced together. In the first library, potential aptazymes were selected for folate and in the second library, potential aptazymes were selected for a mixture of vitamins. To lower sequencing costs, the folate library and the mixture library were sequenced together. Our sequences follow the structure:
barcode prefix - prefix - potential aptazyme - suffix - barcode suffix.
- Barcode prefix: indicates whether a sequence belongs to the condition -ligand or +ligand. This barcode was used during the demultiplexing of the data.
- Prefix: indicates whether an aptazyme sequence was cleaved or uncleaved during the previous round.
- Potential aptazyme: potential aptazyme sequence, the hammerhead ribozyme stem with random loops.
- Suffix: a sequence that is needed for the amplification step in DRIVER and for the NGS adapter ligation.
- Barcode suffix: indicates whether a sequence comes from the folate library or the vitamins library. This barcode was used to demultiplex the data.
Data analysis
The steps of the analysis are as follows [1]:
- Paired-end alignment: The forward and reverse read sequences are paired-end aligned to increase the overall read length. The Paired-End reAd mergeR (PEAR) was used for the alignment since it allows for merging DNA fragments of varying lengths [4]. PEAR already performs quality checks during the merging. We used the default values of PEAR for our analysis. The resulting assembled reads file consists of individual reads.
- Grouping identical reads: Identical reads of the assembled reads file are grouped together and sorted based on the number of read counts. A table is formed with the number of reads observed of every sequence.
- File splitting: The resulting file contains sequences from the conditions -ligand and +ligand. The sequences are split into three files. The sequences with the barcode prefix -ligand are stored in the first file, the sequences with the barcode prefix +ligand are stored in the second file, and the sequences with neither barcode are stored in the third file.
Database insertion: All sequences that have a minimum of 25 reads are inserted into an SQLite database. The database insertion consists of the following intermediate steps:
- The prefix position and type are determined. The position of the prefix can vary due to deletions or substitutions in the barcode prefix. To find the position, we apply approximate pattern matching using regex as we saw that allowing one error in the prefix improved our analysis [5]. Afterwards, we checked whether the type of prefix found corresponds to cleavage in the previous round or not.
- The suffix position and type are determined in the same way as the prefix.
- The potential aptazyme sequence is the sequence between the prefix and the suffix. Therefore, the position of the potential aptazyme sequence is easily retrieved. We check whether the sequence is a reference sequence and store this information. The reference sequences were added at known concentrations prior to barcoding. This enables us to compute the absolute concentration of every sequence.
- A unique accession number is assigned to every unique potential aptazyme sequence. This improves the speed of downstream steps because it allows for easy and fast retrieval of information of a specific sequence.
- Combine sequences (DRIVER structure-specific): This step is not necessary for our own sequence structure. But, we did not remove it from the pipeline because inclusion of this step makes the pipeline more modular. If the original DRIVER protocol is followed, it is likely that sequences have the same prefix - aptamer - suffix sequence, but different barcodes. These have to be combined as the barcode is only used for sequencing improvement.
Calculate cleavage fraction: The cleavage fraction (cs) of the conditions -ligand and +ligand are calculated for every sequence. The cleavage fraction for a condition can only be computed if a sequence has read counts with a prefix indicating cleavage and read counts with a prefix indicating non-cleavage. The cs of a sequence is computed as the ratio of the concentration of the cleaved sequences to the total concentration of that sequence. The reference sequences are used to allow the calculations to be computed with absolute concentrations. Hence, the cleavage fraction for a sequence s is calculated as follows:
where r stands for read count, clvd for cleaved prefix, unclvd for uncleaved prefix, and ref for reference sequence. The higher the computed cleavage fraction, the higher the number of potential aptazymes that were cleaved.
Calculate fold change: The fold change (ƒs) of the cleavage fraction between -ligand and +ligand for a sequence s can be computed as follows:
The factor k is used to compensate for slight variations in the experimental conditions. This factor is set to a value such that the median value of all the sequences is equal to 1.0. This is required since only a small fraction of the sequences within each library is sensitive to any particular ligand. The fold change can only be calculated for sequences that have a cleavage fraction for both conditions.
- Bootstrapping: To obtain the standard error and confidence interval of the cleavage fraction and the fold change, 1000 bootstrap samples are drawn. The bootstrap samples are drawn with replacement from all the observed reads, including the reference sequences. The bootstrap sample size is equal to the total number of read counts. The cleavage fraction and fold change are calculated for every bootstrap sample. The values for cs and ƒs for each sequence are averaged to compute the ĉs and the ƒ̂s, respectively. The standard error of the mean and the resulting 95%-confidence interval are also calculated for ĉs and ƒ̂s.
Biosensor hit assay: The p-value for a sequence that is not a biosensor is calculated. To do this, the Z-score is computed as follows:
where ƒ̂s is the fold change for a sequence s, the expected fold change when it is not a biosensor is 1. Finally, the s is the standard error. From this Z-score, a p-value is calculated. Sequences that have a fold change > 2.0 and a p-value < 1/N, where N is the number of sequences measured, are flagged as possible biosensors. This information is stored in the database.
- Plot results: The results from the analysis are plotted in two figures. The standard error against the fold change is plotted for every sequence. The cleavage fraction for -ligand is plotted against the cleavage fraction for +ligand. Flagged potential aptazyme sequences are shown in the terminal. These sequences can be tested in the lab for verification.
Discussion
We developed our own analysis pipeline to analyze the sequencing data from the laboratory based on the work of Townshend et al. [1]. With the help of our analysis pipeline, we found possible biosensors and relayed this information back to the wet lab team. The results of our analysis and our conclusions can be found on the Results page. Although we got results with the help of our pipeline, in August 2021 the author’s original code was published online. Their analysis was quite different from ours and they performed some steps not mentioned in their paper. They allowed for errors within a sequence and performed a lot more extra checks [6]. Unfortunately, we did not have time to implement their ideas in our analysis, since we would have to make some significant changes to our source code. Hence, we would advise future iGEM teams performing DRIVER to take a look at the original code used for the analysis and adapt it to their needs if required. However, our analysis pipeline makes use of the main DRIVER analysis principles, but is more constrained and thus more susceptible to false-negative results (type II errors).
References
- Townshend, B., Xiang, J., Manzanarez, G., Hayden, E., & Smolke, C. (2021). A multiplexed, automated evolution pipeline enables scalable discovery and characterization of biosensors. Nature Communications, 12(1) https://doi.org/10.1038/s41467-021-21716-0
- Townshend, B. (2021). CleaveSeq: Scalable characterization of ribozyme-based RNA biosensors. https://doi.org/10.21203/rs.3.pex-1346/v1
- Mölder, F., Jablonski, K., Letcher, B., Hall, M., Tomkins-Tinch, C., & Sochat, V. et al. (2021). Sustainable data analysis with Snakemake. F1000research, 10, 33. https://doi.org/10.12688/f1000research.29032.1
- Zhang, J., Kobert, K., Flouri, T., & Stamatakis, A. (2013). PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics, 30(5), 614-620. https://doi.org/10.1093/bioinformatics/btt593
- Regex (2021). PyPI. (2021). Adopted from https://pypi.org/project/regex/ on 12/10/2021.
- GitHub - btownshend/DRIVER-Analysis: Code associated with 2020 DRIVER paper. GitHub. (2021). Retrieved from https://github.com/btownshend/DRIVER-Analysis on 13/10/2021.