Despite their significant advantages over competing technologies, nanopore sequencers are plagued by high error rates, due to physical characteristics of the nanopore and inherent noise in the biological processes. It is thus paramount not only to formulate efficient error-correcting constructions for these channels, but also to establish bounds on the minimum redundancy required by such coding schemes. In this context, we adopt a simplified model of nanopore sequencing inspired by the work of Mao \emph{et al.}, accounting for the effects of intersymbol interference and measurement noise. For an input sequence of length $n$, the vector that is produced, designated as the \emph{read vector}, may additionally suffer at most \(t\) substitution errors. We employ the well-known graph-theoretic clique-cover technique to establish that at least \(t\log n -O(1)\) bits of redundancy are required to correct multiple (\(t \geq 2\)) substitutions. While this is surprising in comparison to the case of a single substitution, that necessitates at most \(\log \log n - O(1)\) bits of redundancy, a suitable error-correcting code that is optimal up to a constant follows immediately from the properties of read vectors.
翻译:尽管纳米孔测序技术相较于竞争技术具有显著优势,但由于纳米孔的物理特性及生物过程中固有的噪声,其错误率一直居高不下。因此,不仅需要为这类信道构建高效的纠错结构,还必须确定此类编码方案所需的最小冗余度界限。在此背景下,我们采用受Mao等人工作启发的简化纳米孔测序模型,该模型考虑了符号间干扰与测量噪声的影响。对于长度为$n$的输入序列,所生成的向量(称为读段向量)最多可能额外遭受\(t\)次替换错误。我们采用经典的图论团覆盖技术证明:校正多重(\(t \geq 2\))替换错误至少需要\(t\log n -O(1)\)比特的冗余度。虽然相较于单次替换错误仅需至多\(\log \log n - O(1)\)比特冗余度的情况,这一结论令人惊讶,但根据读段向量的特性,可以立即构造出在常数范围内最优的合适纠错码。