The biochemical processes underlying DNA data storage, including synthesis, amplification, and sequencing, are inherently noisy. Consequently, base-level insertion, deletion, and substitution (IDS) errors, as well as sequence-level dropouts, occur and pose major challenges for reliable data retrieval. Here we introduce DNA-MGC+, a DNA storage codec designed to enable reliable and resource-efficient data retrieval under diverse operating conditions. We evaluate DNA-MGC+ across a wide range of in silico and in vitro settings, including experiments with both Illumina and Nanopore sequencing, and show that it consistently outperforms existing codecs. In particular, DNA-MGC+ achieves simultaneous gains in sequencing depth requirements, read cost, decoding time, storage density, and error-correction capability under explicit reliability constraints. Notable results include reliable decoding under IDS error rates of up to 24% in synthetic scenarios, and reliable retrieval at sequencing depths below 3x with read costs below 3.5 bits/nt under electrochemical synthesis for both Illumina and Nanopore sequencing.
翻译:DNA数据存储所依赖的生化过程,包括合成、扩增和测序,本质上具有噪声。因此,碱基层面的插入、删除和替换错误,以及序列层面的丢失现象会发生,并对可靠的数据检索构成重大挑战。本文介绍DNA-MGC+,这是一种DNA存储编解码器,旨在多种操作条件下实现可靠且资源高效的数据检索。我们在广泛的计算机模拟和体外实验设置中评估DNA-MGC+,包括使用Illumina和Nanopore测序的实验,结果表明其性能持续优于现有编解码器。具体而言,在明确的可靠性约束下,DNA-MGC+在测序深度要求、读取成本、解码时间、存储密度和纠错能力方面同时取得增益。值得注意的结果包括:在合成场景中,能在高达24%的IDS错误率下实现可靠解码;对于Illumina和Nanopore测序,在电化学合成条件下,能在测序深度低于3倍且读取成本低于3.5比特/核苷酸时实现可靠检索。