In shotgun sequencing, the input string (typically, a long DNA sequence composed of nucleotide bases) is sequenced as multiple overlapping fragments of much shorter lengths (called \textit{reads}). Modelling the shotgun sequencing pipeline as a communication channel for DNA data storage, the capacity of this channel was identified in a recent work, assuming that the reads themselves are noiseless substrings of the original sequence. Modern shotgun sequencers however also output quality scores for each base read, indicating the confidence in its identification. Bases with low quality scores can be considered to be erased. Motivated by this, we consider the \textit{shotgun sequencing channel with erasures}, where each symbol in any read can be independently erased with some probability $\delta$. We identify achievable rates for this channel, using a random code construction and a decoder that uses typicality-like arguments to merge the reads.
翻译:在鸟枪测序中,输入字符串(通常为由核苷酸碱基组成的长DNA序列)被测序为多个重叠的短片段(称为\textit{读段})。将鸟枪测序流程建模为DNA数据存储的通信信道,近期研究确定了该信道的容量,假设读段本身是原始序列中无噪声的子串。然而,现代鸟枪测序仪还会为每个碱基读段输出质量分数,指示其识别置信度。低质量分数的碱基可被视作擦除。受此启发,我们考虑\textit{含擦除的鸟枪测序信道},其中任何读段中的每个符号均可能以概率$\delta$独立擦除。采用随机编码构造及基于典型性论证合并读段的解码器,我们确定了该信道的可达速率。