Data storage in DNA is developing as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth.
翻译:DNA数据存储正发展成为档案数字数据的一种潜在解决方案。近期,为进一步提升基于DNA的数据存储系统的潜在容量,研究者提出了组合复合DNA合成方法。该方法通过利用称为短片段DNA试剂(短mers)的短DNA片段来扩展DNA字母表。短mers是字母表符号的构建模块,由固定数量的短mers组成。因此,在读取信息时,构成符号组合的其中一个短mers可能缺失,导致该符号无法被确定。本文将此类错误建模为一种非对称错误,并提出能够在此设置下纠正此类错误的码构造方案。我们还给出了此类纠错码冗余度的下界,并为所提构造提供了显式的编码器和解码器对。所提出的错误模型也通过实际实验中根据组合方案生成的DNA数据分析得到了支持。最后,我们还提供了作为读取深度函数的此类错误事件发生概率的统计评估。