Correcting Contextual Deletions in DNA Nanopore Readouts

The problem of designing codes for deletion-correction and synchronization has received renewed interest due to applications in DNA-based data storage systems that use nanopore sequencers as readout platforms. In almost all instances, deletions are assumed to be imposed independently of each other and of the sequence context. These assumptions are not valid in practice, since nanopore errors tend to occur within specific contexts. We study contextual nanopore deletion-errors through the example setting of deterministic single deletions following (complete) runlengths of length at least $k$. The model critically depends on the runlength threshold $k$, and we examine two regimes for $k$: a) $k=C\log n$ for a constant $C\in(0,1)$; in this case, we study error-correcting codes that can protect from a constant number $t$ of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between $(1-C)t\log n$ and $2(1-C)t\log n$, meaning that it is a ($1-C$)-fraction of that of arbitrary $t$-deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant $t$. In particular, for $t=1$ and $C>1/2$ we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b) $k$ equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least $k$, which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least $k$ with probability $p$ and taking the limit $p\to 1$. We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant $k$.

翻译：由于基于DNA的数据存储系统使用纳米孔测序仪作为读出平台，删除纠错与同步编码的设计问题重新受到关注。在几乎所有现有研究中，删除被假定为彼此独立且与序列上下文无关。这些假设在实践中并不成立，因为纳米孔错误倾向于发生在特定上下文中。我们通过确定性单删除（发生在长度至少为$k$的完整游程之后）的示例场景研究上下文纳米孔删除错误。该模型关键取决于游程阈值$k$，我们考察$k$的两种情形：a) $k=C\log n$，其中$C\in(0,1)$为常数；在此情形下，我们研究能够防护常数$t$个上下文删除的纠错码，并证明最小冗余度（忽略低阶项）介于$(1-C)t\log n$与$2(1-C)t\log n$之间，这意味着其是任意$t$删除纠错码冗余度的($1-C$)倍。为补充非构造性冗余度上界，我们针对任意常数$t$设计了可高效编解码的编码方案。特别地，对于$t=1$且$C>1/2$，我们构建了冗余度基本匹配非构造性上界的高效编码；b) $k$为常数；在此情形下，我们考虑删除数量无界且每个长度至少$k$的游程后均发生删除的极值问题，称之为极值上下文删除信道。该组合设定自然源于考虑一个概率信道：该信道以概率$p$在每个长度至少$k$的游程后引入上下文删除，并取极限$p\to 1$。我们针对任意常数$k$，获得了极值上下文删除信道下可实现的最大速率的精确界。