This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length $\ell$ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of $\ell$ that asymptotically behave like the lower bound.
翻译:本文提出了一类新的重建码,其动机源于DNA数据存储与测序应用。在此类应用中,通过读取DNA链的部分子串对其进行测序。以往研究关注两种极端情况:读取所有预定义长度的子串,或针对单条DNA链读取无重叠子串。本文对此范式进行两种扩展。第一种扩展考虑以给定最小重叠度读取连续子串的场景。首先给出保证唯一重建的码可达速率的上界,随后提出渐近达到该上界的高效码构造方案。第二种扩展研究多条DNA链联合重建的场景。给定链的数量及长度,首先推导确保具有非零速率的链重建码存在所需的子串长度$\ell$下界,随后提出两种该类码的构造方案,并证明当$\ell$渐近趋近该下界时,其速率趋近于1。