Nanopore sequencing is a promising technology for DNA sequencing. In this paper, we investigate a specific model of the nanopore sequencer, which takes a $q$-ary sequence of length $n$ as input and outputs a vector of length $n+\ell-1$ referred to as an $\ell$-read vector where the $i$-th entry is a multi-set composed of the $\ell$ elements located between the $(i-\ell+1)$-th and $i$-th positions of the input sequence. Considering the presence of substitution errors in the output vector, we study $\ell$-read codes under the Hamming metric. An $\ell$-read $(n,d)_q$-code is a set of $q$-ary sequences of length $n$ in which the Hamming distance between $\ell$-read vectors of any two distinct sequences is at least $d$. We first improve the result of Banerjee \emph{et al.}, who studied $\ell$-read $(n,d)_q$-codes with the constraint $\ell\geq 3$ and $d=3$. Then, we investigate the bounds and constructions of $2$-read codes with a minimum distance of $3$, $4$, and $5$, respectively. Our results indicate that when $d \in \{3,4\}$, the optimal redundancy of $2$-read $(n,d)_q$-codes is $o(\log_q n)$, while for $d=5$ it is $\log_q n+o(\log_q n)$. Additionally, we establish an equivalence between $2$-read $(n,3)_q$-codes and classical $q$-ary single-insertion reconstruction codes using two noisy reads. We improve the lower bound on the redundancy of classical $q$-ary single-insertion reconstruction codes as well as the upper bound on the redundancy of classical $q$-ary single-deletion reconstruction codes when using two noisy reads. Finally, we study $\ell$-read codes under the reconstruction model.
翻译:纳米孔测序是一种前景广阔的DNA测序技术。本文研究纳米孔测序仪的一种特定模型:该模型以长度为 $n$ 的 $q$ 元序列为输入,输出长度为 $n+\ell-1$ 的向量(称为 $\ell$-读向量),其中第 $i$ 个条目是由输入序列第 $(i-\ell+1)$ 至第 $i$ 个位置上的 $\ell$ 个元素构成的多重集。针对输出向量中可能出现的替换错误,我们研究汉明度量下的 $\ell$-读码。一个 $\ell$-读 $(n,d)_q$-码是指一组长度为 $n$ 的 $q$ 元序列,使得任意两个不同序列的 $\ell$-读向量之间的汉明距离至少为 $d$。我们首先改进了Banerjee 等人在约束条件 $\ell\geq 3$ 且 $d=3$ 下关于 $\ell$-读 $(n,d)_q$-码的研究结果。随后,分别探究了最小距离为 $3$、$4$、$5$ 的 $2$-读码的界与构造。结果表明:当 $d\in\{3,4\}$ 时,$2$-读 $(n,d)_q$-码的最优冗余度为 $o(\log_q n)$;当 $d=5$ 时,最优冗余度为 $\log_q n+o(\log_q n)$。此外,我们建立了两类码的等价性:$2$-读 $(n,3)_q$-码与经典 $q$ 元单插入重建码(使用两次有噪读取)。我们改进了经典 $q$ 元单插入重建码冗余度的下界,以及经典 $q$ 元单删除重建码(使用两次有噪读取)冗余度的上界。最后,我们研究了重建模型下的 $\ell$-读码。