DNA has been considered a promising medium for storing digital information. As an essential step in the DNA-based data storage workflow, coding algorithms are responsible to implement functions including bit-to-base transcoding, error correction, etc. In previous studies, these functions are normally realized by introducing multiple algorithms. Here, we report a graph-based architecture, named SPIDER-WEB, providing an all-in-one coding solution by generating customized algorithms automatically. SPIDERWEB is able to correct a maximum of 4% edit errors in the DNA sequences including substitution and insertion/deletion (indel), with only 5.5% redundant symbols. Since no DNA sequence pretreatment is required for the correcting and decoding processes, SPIDER-WEB offers the function of real-time information retrieval, which is 305.08 times faster than the speed of single-molecule sequencing techniques. Our retrieval process can improve 2 orders of magnitude faster compared to the conventional one under megabyte-level data and can be scalable to fit exabyte-level data. Therefore, SPIDER-WEB holds the potential to improve the practicability in large-scale data storage applications.
翻译:DNA 一直被视为存储数字信息的有前景介质。在基于DNA的数据存储流程中,编码算法作为关键环节,负责实现比特到碱基的转码、纠错等功能。以往研究中,这些功能通常通过引入多个算法来实现。本文报告一种名为SPIDER-WEB的基于图结构的架构,通过自动生成定制化算法提供一体化编码解决方案。SPIDER-WEB能够纠正DNA序列中最多4%的编辑错误(包括替换和插入/删除),冗余符号仅占5.5%。由于纠错和解码过程无需对DNA序列进行预处理,SPIDER-WEB具备实时信息检索功能,其速度比单分子测序技术快305.08倍。在兆字节级数据下,我们的检索速度相比传统方法可提升两个数量级,并可扩展至适配艾字节级数据。因此,SPIDER-WEB有望提升大规模数据存储应用中的实用性。