The problem of reconstructing a sequence from the set of its length-$k$ substrings has received considerable attention due to its various applications in genomics. We study an uncoded version of this problem where multiple random sources are to be simultaneously reconstructed from the union of their $k$-mer sets. We consider an asymptotic regime where $m = n^\alpha$ i.i.d. source sequences of length $n$ are to be reconstructed from the set of their substrings of length $k=\beta \log n$, and seek to characterize the $(\alpha,\beta)$ pairs for which reconstruction is information-theoretically feasible. We show that, as $n \to \infty$, the source sequences can be reconstructed if $\beta > \max(2\alpha+1,\alpha+2)$ and cannot be reconstructed if $\beta < \max( 2\alpha+1, \alpha+ \tfrac32)$, characterizing the feasibility region almost completely. Interestingly, our result shows that there are feasible $(\alpha,\beta)$ pairs where repeats across the source strings abound, and non-trivial reconstruction algorithms are needed to achieve the fundamental limit.
翻译:从长度为$k$的子串集合中重建序列的问题因其在基因组学中的多种应用而受到广泛关注。我们研究该问题的无编码版本,其中多个随机源需要从它们的$k$-mer集合的并集中同时被重建。我们考虑一种渐近情况,其中$m = n^\alpha$个长度为$n$的独立同分布源序列需要从其长度为$k=\beta \log n$的子串集合中重建,并致力于刻画信息论意义上重建可行的$(\alpha,\beta)$对。我们证明,当$n \to \infty$时,若$\beta > \max(2\alpha+1,\alpha+2)$,则源序列可被重建;若$\beta < \max( 2\alpha+1, \alpha+ \tfrac32)$,则无法重建,从而几乎完全刻画了可行性区域。有趣的是,我们的结果表明存在可行的$(\alpha,\beta)$对,其中源字符串间的重复频繁出现,且需要非平凡的重建算法才能达到这一基本极限。