Nanopore sequencers, being superior to other sequencing technologies for DNA storage in multiple aspects, have attracted considerable attention in recent times. Their high error rates however demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, that incorporates intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length $\ell$ over an input sequence, that outputs the $L_1$-weight of the enclosed $\ell$ bits and shifts by $\delta$ positions with each time step. The resulting $(\ell+1)$-ary vector, termed the \emph{read vector}, may also be corrupted by $t$ substitution errors. By employing graph-theoretic techniques, we deduce that for $\delta=1$, at least $\log \log n$ bits of redundancy are required to correct a single ($t=1$) substitution. Finally for $\ell \geq 3$, we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is optimal up to an additive constant for this setting.
翻译:纳米孔测序仪在多个方面优于其他DNA存储测序技术,近年来引起了广泛关注。然而,其高错误率要求对实用且高效的编码方案进行深入研究,以实现存储数据的准确恢复。为此,我们基于Mao等人提出的思路,建立了一个考虑符号间干扰和测量噪声的简化纳米孔测序模型。本质上,我们的信道模型通过一个长度为ℓ的滑动窗口遍历输入序列,输出窗口内ℓ个比特的L1权重,并在每个时间步长移动δ个位置。由此得到的(ℓ+1)进制向量(称为“读出向量”)可能还会受到t个替换错误的污染。通过采用图论技术,我们推导出当δ=1时,至少需要log log n比特的冗余才能纠正一个(t=1)替换错误。最后,对于ℓ ≥ 3的情况,我们利用读出向量的某些固有特征,得到了一种在此设定下最优(至多附加常数项)的纠错码。