Data storage in DNA is developing as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth.
翻译:DNA数据存储正发展成为数字档案数据的一种潜在解决方案。近期,为进一步提升基于DNA的数据存储系统的潜在容量,研究者提出了组合复合DNA合成方法。该方法通过利用称为短链试剂的短DNA片段来扩展DNA字母表。这些短链是字母符号的构建模块,每个符号由固定数量的短链组成。因此,在读取信息时,可能出现构成符号组合的某个短链缺失,导致符号无法被识别的情况。本文将此类型错误建模为一种非对称错误,并提出了在该设置下能够纠正此类错误的编码构造方案。我们同时给出了此类纠错码冗余度的下界,并为所提构造提供了显式的编码器-解码器对。我们提出的错误模型还得到了按照组合方案实际合成DNA的实验数据分析的支持。最后,我们还基于读取深度函数,对此类错误事件发生概率进行了统计评估。