Purpose of Our Software
The goal of Storagene is to store data on DNA. To do this, we first need to encode data in form of a base sequence. Even more importantly, we need to be able to decode a sequenced DNA strand to recover the original data. These two tasks are performed by our software dna_utils. The main focus lies in error correction in the decoding process. Many errors can occur during DNA synthesis and sequencing. By utilizing the peculiarities of our semi-specific synthesis in combination with the capabilities of nanopore sequencing, we will be able to robustly decode data from DNA. This plays an important role in making this technique easier to use and thus accessible to everyone. Our software addresses a problem that is not very common today. However, it has the potential to become a standard bioinformatics tool when DNA data storage is eventually technically integrated into everyday life.
Overview
Our software is written in Python. Right now, it consists of a program library and a command-line tool that utilizes the library. The command-line tool makes all the basic functionalities of our library available, namely encoding and decoding data. In addition, it would be possible to implement a graphical user interface, which would make the usage even easier for everyone.
In the following, the two functions of our software encoding and decoding are explained in detail. It will become clear how we can store data on DNA, although our synthesis approach is semi-specific. We will also discuss the challenges that arise when decoding DNA strands.
Encoding Data into DNA
Figure 1: Overview of the steps needed for encoding data into DNA.
Changes of Nucleotides as Information Carriers
Our DNA synthesis is semi-specific, meaning that we cannot control how many nucleotides are attached in each step. This means that we will always have chunks of an unknown length of each nucleotide, e.g. "AAACCCCCAAAAAGGGG". These chunks are also called homonucleotides. Nevertheless, to be able to encode data in this, we are using a rather unusual encoding scheme. Instead of encoding one bit of information to each base pair, we are encoding information in each change of bases. For example, the change from a homonucleotide of As to a homonucleotide of Cs could encode a 1
.
Using a Ternary Code
From each homonucleotide, three other homonucleotides are reachable. From "A" we can go to "C", "G", and "T". From "C" we can go to "A", "G", and "T", and so on. So, to be able to use all possible combinations, we do not need a binary code, but a ternary code. Computers work with a binary code, meaning that every information is encoded in bits. Each bit can either be 0
or 1
. Instead, in our ternary code information is encoded in trits. Each trit can have the states 0
, 1
, or 2
. We adapted both the idea of encoding information in base changes and the use of a ternary system from Church et al. 1.
Converting from Binary to Ternary
Our software can encode arbitrary files into DNA strands. Therefore, the first step is to convert binary data into ternary data. Computers work with bytes, so chunks of eight bits. Hence, it is natural to first split the string of bits we get from a file into bytes. Then, the software converts each byte into trits.
The count of values a number can take is calculated by the formula $b^l$. In this $b$ is the base of the number system, e.g., $2$ for the binary system, and the superscript 1 is the number of digits. Since a byte consists of eight bits with two states each, every byte can have $2^8=256$ states. To represent the information of one byte with trits, we need enough trits to represent at least $256$ different states. Five trits can represent $243$ states, since $3^5=243$. So obviously, five trits are not enough to encode one byte. Six trits, in contrast, represent $3^6=729$ states. So one byte can be encoded with six trits. In the following, we will call a sequence of six trits a tryte. However, this also shows that converting one byte into one tryte is extremely inefficient. $\frac{256}{729} \approx 35.12$%, so roughly one-third of the actual capacity of this code is used.
Converting Number Systems More Efficiently
Instead of converting eight bits into six trits, it is also possible to cut up the original data differently. In the best case, this can lead to a more efficient conversion. For example, three bits could be converted into two trits. Since $2^3=8$ and $3^2=9$, $\frac{8}{9}$ or 89% of the maximum capacity is used. And even more efficient conversion would be from eleven bits to seven trits with a capacity of 94%. However, both of these conversions would also require some additional computation. Arbitrary bit strings usually cannot be divided into chunks of three or eleven bits. Therefore, the last chunk would need to be padded with bits. This means that the encoded data also needs to contain the information of how many bits were padded to the end. Due to this additional complexity, other conversions than from byte to tryte are not implemented in our software right now.
Decoding and Error Correction in Sequencing Data
Figure 2: Overview of the steps for retrieving information from DNA.
As you can see in figure 2, decoding mainly works just as encoding but backward. However, one additional step is necessary before data can be decoded. From the sequenced data, we first must retrieve the original DNA sequence that was used for synthesis. To do so, we must understand what happens in synthesis. As discussed before, an unknown number of nucleotides is attached in each step. This is due to the semi-specific approach. In addition to that, various primers and other sequences are added to both ends of the encoded sequence. This partly happends during synthesis and partly in preparation for sequencing.
Figure 3: Example of DNA sequences in a fastq file.
The decoding starts with a fastq
file containing sequencing data. Figure 3 shows an example of such a file. Each read from sequencing is expressed by four lines. While the first and third lines can contain an identifier and description of the sequence, the second and fourth lines are more important to us. The second line contains the actual DNA sequence. In line 4 the quality of the sequence is encoded. This is important to assess how reliable the sequencing was. The quality is encoded in values reaching from 0x21 (33 in decimal) to 0x7E (126) and displayed as the corresponding ASCII character. So "!" marks the lowest quality, while "~" stands for the highest quality.
Removing Primers and Cleaning Up Data
The first step necessary for decoding is the removal of the primer since this does not contain any relevant information. To do so, the software scans every sequence for a subsequence that is similar to the primer. If it finds the primer in a sequence, it deletes the primer and everything prior to it from the sequence. If it does not find a primer, it deletes the whole sequence from the dataset. As shown in figure 4, the sequence encoding information begins right behind the primer. So by removing the primer and everything before it, we know exactly that the sequence now begins with the encoding sequence. After that, this step is repeated with the reverse primer. The reverse primer starts directly behind the encoding sequence. After removing it and everything behind it, all sequences now only contain the relevant information.
Figure 4: Overview of the components of a sequence.
Before the primer, there is an additional sequence added. This is necessary for nanopore sequencing. By clearing the sequences in the way described before, these are also removed automatically. If there is any sequence not containing one of the primers, it will be removed completely. This ensures that no corrupted data disturb the further steps.
Finding Primer Sequences with Errors
When searching for primers in the sequences, not only perfect matches should be found. In synthesis and sequencing, errors can happen in various places. So all subsequences that are very similar to a primer should be classified as such. For that classification, the Levenshtein distance is used. This is a very popular measure for all kinds of string comparisons. It describes how many insertions, deletions, or substitutions of symbols are needed to match two strings. For example, the sequences "ACGT" and "ACGA" have a Levenshtein distance of 1, since one substitution is needed to match them. In the same manner, the sequences "ACGT" and "ACGTA" also have a distance of 1, since they can be matched with one insertion in the first sequence.
Merging the Sequences
After removing additional parts from the sequences, only the relevant parts should be left. Now the obvious next step would be to take a sequence, merge all homonucleotides into single nucleotides and go on with decoding. However, many potential sources of error make it unlikely that every sequence contains the full and correct information. To compensate for that, our software compares all remaining sequences to find the most likely original sequence. First, every sequence gets transformed into a sequence of homonucleotides internally. Each homonucleotide holds information about the nucleotide of which it is composed, but also its length and average quality. This is visualized in figure 5. Then, the first homonucleotides of all sequences are compared to find the most likely one. For that, the lengths and average qualities are used to find the most likely nucleotide. This step is repeated multiple times to recreate the whole sequence. Usually, the sequences do not all have the same lengths. Many sequences only contain a few homonucleotides, while some are much longer.
Figure 5: Overview of the representation of a sequence as homonucleotides.
To decide when to stop comparing the remaining sequences, a percentile is used. For example, the percentile can be 50%. That means that the sequences are compared as long as at least 50% of all sequences reach that lengths. If 50 out of 100 sequences contain six homonucleotides and all others are shorter, for example, the first six homonucleotides are compared.
After that step, the software saves the recreated sequence in a fasta
file. This is similar to the fastq file mentioned before but does not contain a quality measure. The sequence in that file can then be used to decode the original information again. As said before, this works in the same manner as encoding but backward.
Application as Demonstration for Integrated Human Practices
As described on our Human Practices page, we reached out to the city archive in Aachen to discuss potential use cases of our proposed storage solution. The city archive has digitalized many historical documents like newspapers. As a technology demonstration, we encoded a scan of a newspaper article from 1933.
Figure 6: Demonstration of our encoding software.
Figure 6 shows a clipping of the newspaper page as well as an extract from the encoded sequence. With the link below, you can download the full sequence as a fasta file.
Download
You can find our software on GitHub