‘Data is the new oil’ is probably one of the most popular catchlines in the modern era, highlighting the significance and value of data.
By 2025, it is estimated that there will be 175 zettabytes of data in the global datasphere. That’s equivalent to one sextillion bytes…or one trillion gigabytes…or 1,000,000,000,000,000,000,000 bytes! That’s an awful lot of zeros! To give you a more practical comparison, with 175 zettabytes of data, you could store the entire series of Game of Thrones over 11 billion times! (assuming an episode in HD is roughly 1.22 gigabytes * 73 total episodes).
Our current magnetic or optical data storage systems, which hold this volume of data in 0s and 1s, could last for at most a century. Storing this colossal amount of data is also extremely costly – exabyte-scale (1 billion gigabytes) data centres with the size of several football fields cost $1 billion to build and maintain. So as you can see, data storage might be an imminent problem.
But somewhere…over…the thymine… researchers believe that an alternative solution lies in the blueprint of life: DNA.
And as a proof-of-concept, they’ve stored a copy of The Wizard of Oz – translated into Esperanto – using a technique that harnesses the information-storage capacity of intertwined DNA strands consisting of the four nucleotides A (adenine), T (thymine), G (guanine) and C (cytosine). They were able to successfully encode and retrieve information in a way that is both compact and durable.
OMG, how is that possible?
Computer code to genetic code.
The storage capacity of the genetic material is mind-boggling – it can stow massive amounts of data at a density far surpassing that of our run-off-the-mill electronic devices. E. Coli, the common bacterium, can store about 1.25 exabytes. Data from the American Library of Congress can be stored in a DNA archive the size of a poppy seed – 6,000 times over. Half of that poppy seed can store all of Facebook’s data. All of the data ever generated by humanity can be stored in a sphere no bigger than a ping-pong ball.
Cells in our bodies have a lot in common with computers. Computers encode information in series of 0s and 1s, which, when read, execute programs. Organic cells store information in the A, T, G and C nucleotide bases, located in the DNA, which, when read, produce proteins. Strung together these bases make the biological code for life.
Check out this informative video by SciShow on why DNA is not just for life anymore!
Digital information can also be stored and retrieved using DNA through code writing via DNA synthesis and code reading via DNA sequencing. For instance, A and T could be used to represent 0 and 1, respectively. DNA molecules are then synthesised letter by letter via enzymatic reactions and indexed. Written segments of DNA are then stored in solution or dried. To retrieve that stored information, the targeted segment of the DNA strand is then decoded by sequencing equipment – like how bioinformaticians carry out genome sequencing – and translated back into the original digital file with decoding algorithms.
Though DNA is already routinely sequenced, synthesised and accurately copied by bioinformaticians, its ability to archive a staggering amount of information in a highly efficient and stable manner is what makes it so promising.
Lost in translation.
One of the major bottlenecks in DNA data storage, other than its cost (currently it takes $1 trillion to write one petabyte of data), is the complexity involved in retrieving your targeted file from the sea of data files. The end result is often lots of pesky errors.
DNA introduces roughly one mistake per 1000 nucleotides in the forms of substitutions, insertions and deletions. At the same time, picking out the data file you want is also akin to searching for a needle in a haystack.
Scientists are now looking for innovative methods to replace conventional techniques, such as the polymerase chain reaction (PCR), to retrieve DNA files more systematically. Some researchers have developed encapsulation procedures to preserve each DNA file into small silica particles. The particles are then labelled with “barcodes” that correspond to the contents of the file, making them easier to be identified during data retrieval.
While DNA-based information storage is still limited to niche applications, plenty of research is currently underway to fully transform DNA into data powerhouses that can satisfy our voracious appetite for data, paving a yellow brick road to the data storage solutions of the future.