The surge in demand for cost-effective, durable long-term archival media, coupled with density limitations of contemporary magnetic media, has resulted in synthetic DNA emerging as a promising new alternative. Despite its benefits, storing data on DNA poses several challenges as the technology used for reading/writing data and achieving random access on DNA are highly error prone. In order to deal with such errors, it is important to design efficient pipelines that can carefully use redundancy to mask errors without amplifying overall cost. In this work, we present Columnar MOlecular Storage System (CMOSS), a novel, end-to-end DNA storage pipeline that can provide error-tolerant data storage at low read/write costs. CMOSS differs from SOTA on three fronts (i) a motif-based, vertical layout in contrast to nucleotide-based horizontal layout used by SOTA, (ii) merged consensus calling and decoding enabled by the vertical layout, and (iii) a flexible, fixed-size, block-based data organization for random access over DNA storage in contrast to the variable-sized, object-based access used by SOTA. Using an in-depth evaluation via simulation studies and real wet-lab experiments, we demonstrate the benefits of various CMOSS design choices. We make the entire pipeline together with the read datasets openly available to the community for faithful reproduction and furthering research.
翻译:对成本效益高、耐用的长期归档介质的需求激增,加上当代磁介质的密度限制,使得合成DNA成为一种有前景的新型替代方案。尽管具有优势,但在DNA上存储数据仍面临若干挑战,因为用于读写数据和在DNA上实现随机访问的技术极易出错。为了应对此类错误,设计高效的流程至关重要,该流程需谨慎利用冗余来掩盖错误,同时不增加总体成本。在本工作中,我们提出了柱式分子存储系统(CMOSS),这是一种新颖的端到端DNA存储流程,能够以较低的读写成本提供容错数据存储。CMOSS在三个方面区别于现有技术(SOTA):(i) 采用基于基序的垂直布局,而非SOTA使用的基于核苷酸的水平布局;(ii) 通过垂直布局实现合并的共识序列识别与解码;(iii) 采用灵活的、固定大小的基于块的数据组织方式,用于DNA存储的随机访问,而SOTA使用的是可变大小的基于对象的访问方式。通过模拟研究和真实湿实验的深入评估,我们展示了CMOSS各种设计选择的优势。我们将整个流程连同读取的数据集公开提供给研究社区,以便忠实复现并推动进一步研究。