The yearly global production of data is growing exponentially, outpacing the capacity of existing storage media, such as tape and disk, and surpassing our ability to store it. DNA storage - the representation of arbitrary information as sequences of nucleotides - offers a promising storage medium. DNA is nature's information-storage molecule of choice and has a number of key properties: it is extremely dense, offering the theoretical possibility of storing 455 EB/g; it is durable, with a half-life of approximately 520 years that can be increased to thousands of years when DNA is chilled and stored dry; and it is amenable to automated synthesis and sequencing. Furthermore, biochemical processes that act on DNA potentially enable highly parallel data manipulation. Whilst biological information is encoded in DNA via a specific mapping from triplet sequences of nucleotides to amino acids, DNA storage is not limited to a single encoding scheme, and there are many possible ways to map data to chemical sequences of nucleotides for synthesis, storage, retrieval and data manipulation. However, there are several biological, error-tolerance and information-retrieval considerations that an encoding scheme needs to address to be viable. This comprehensive review focuses on comparing existing work done in encoding arbitrary data within DNA in terms of their encoding schemes, methods to address biological constraints and measures to provide error correction. We compare encoding approaches on the overall information density and coverage they achieve, as well as the data-retrieval method they use (i.e., sequential or random access). We also discuss the background and evolution of the encoding schemes.
翻译:全球每年产生的数据量呈指数级增长,已超越磁带、磁盘等现有存储介质的容量,甚至超出我们的存储能力。DNA存储——将任意信息表示为核苷酸序列——提供了一种极具前景的存储介质。DNA是自然界首选的信息存储分子,具有多项关键特性:密度极高,理论上每克可存储455艾字节;耐久性强,半衰期约520年,若经冷藏干燥处理可延长至数千年;且适用于自动化合成与测序。此外,作用于DNA的生化过程可实现高度并行的数据操作。尽管生物信息是通过特定的核苷酸三联体到氨基酸的映射编码在DNA中,但DNA存储并不局限于单一编码方案,存在多种将数据映射至化学核苷酸序列以实现合成、存储、检索及数据操作的可能方式。然而,编码方案需解决若干生物学、容错及信息检索方面的考量方可实际应用。本综述全面比较了现有将任意数据编码至DNA的研究工作,聚焦其编码方案、应对生物学约束的方法及纠错措施。我们基于总体信息密度与覆盖率、以及所采用的数据检索方式(序列存取或随机存取)对编码方法进行对比,同时探讨了编码方案的发展背景与演进历程。