The problem of designing codes for deletion-correction and synchronization has received renewed interest due to applications in DNA-based data storage systems that use nanopore sequencers as readout platforms. In almost all instances, deletions are assumed to be imposed independently of each other and of the sequence context. These assumptions are not valid in practice, since nanopore errors tend to occur within specific contexts. We study contextual nanopore deletion-errors through the example setting of deterministic single deletions following (complete) runlengths of length at least $k$. The model critically depends on the runlength threshold $k$, and we examine two regimes for $k$: a) $k=C\log n$ for a constant $C\in(0,1)$; in this case, we study error-correcting codes that can protect from a constant number $t$ of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between $(1-C)t\log n$ and $2(1-C)t\log n$, meaning that it is a ($1-C$)-fraction of that of arbitrary $t$-deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant $t$. In particular, for $t=1$ and $C>1/2$ we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b) $k$ equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least $k$, which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least $k$ with probability $p$ and taking the limit $p\to 1$. We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant $k$.
翻译:针对纳米孔测序仪作为读出示平台的DNA数据存储系统应用,删除纠错与同步的编码设计问题重新引起了关注。在几乎所有场景下,删除错误均被假设为相互独立且与序列上下文无关。然而这些假设在实际中并不成立,因为纳米孔错误往往发生在特定上下文中。我们通过确定性单次删除仅出现在(完整)游程长度至少为$k$之后的实例设定,研究纳米孔上下文删除错误。该模型关键取决于游程阈值$k$,我们考察两种$k$取值区间:a) $k=C\log n$,其中$C\in(0,1)$为常数;在此情形下,我们研究能纠正恒定数量$t$个上下文删除的纠错码,并证明其最小冗余(忽略低阶项)介于$(1-C)t\log n$与$2(1-C)t\log n$之间,意味着该冗余量仅为任意$t$-删除纠错码的($1-C$)倍。为补充非构造性冗余上界,我们针对任意恒定$t$设计了高效可编解码的编码方案。特别地,当$t=1$且$C>1/2$时,我们构建的实用编码冗余量基本匹配非构造性上界;b) $k$为常数;此时我们考虑极端问题:删除次数无界,且在每次长度至少为$k$的游程后必然发生删除,我们将此称为极端上下文删除信道。该组合设定自然源于一种概率信道——在每次长度至少为$k$的游程后以概率$p$引入上下文删除,并取极限$p\to 1$。我们获得了任意常数$k$下极端上下文删除信道最大可达码率的紧致界。