In computational genomics, many analyses rely on efficient storage and traversal of $k$-mers, motivating compact representations such as spectrum-preserving string sets (SPSS), which store strings whose $k$-mer spectrum matches that of the input. Existing approaches, including Unitigs, Eulertigs and Matchtigs, model this task as a path cover problem on the deBruijn graph. We extend this framework from paths to branching structures by introducing necklace covers, which combine cycles and tree-like attachments (pendants). We present a greedy algorithm that constructs a necklace cover while guaranteeing, under certain conditions, optimality in the cumulative size of the final representation. Experiments on real genomic datasets indicate that the minimum necklace cover achieves smaller representations than Eulertigs and comparable compression to the Masked Superstrings approach, while maintaining exactness of the $k$-mer spectrum.
翻译:在计算基因组学中,许多分析依赖于$k$-mer的高效存储与遍历,这推动了诸如保谱字符串集(SPSS)等紧凑表示方法的发展——该方法存储的字符串,其$k$-mer谱与输入序列的谱相匹配。现有方法(包括Unitigs、Eulertigs和Matchtigs)将此任务建模为deBruijn图上的路径覆盖问题。我们将该框架从路径扩展至分支结构,引入了项链覆盖——一种结合了环与树状附着结构(悬挂链)的表示。我们提出了一种贪心算法,可在特定条件下构建项链覆盖,并保证最终表示的累积规模达到最优。在真实基因组数据集上的实验表明,最小项链覆盖获得的表示规模小于Eulertigs,并与掩码超字符串方法达到相当的压缩率,同时严格保持$k$-mer谱的精确性。