Most sequence sketching methods work by selecting specific $k$-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the $k$-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. In particular the window guarantee ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee corresponds to a Decycling Set, aka an unavoidable sets of $k$-mers. Any long enough sequence must contain a $k$-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set defines a sketching method by selecting the $k$-mers from the set. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their small size. Only two algorithms, by Mykkeltveit and Champarnaud, are known to generate two particular MDSs, although there is a vast number of alternative MDSs. We provide a simple method that allows one to explore the space of MDSs and to find sets optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length.
翻译:大多数序列草图方法通过从序列中选择特定的$k$-mer来工作,使得仅使用草图就能估计两条序列之间的相似性。与使用序列比对相比,使用草图估计序列相似性速度更快,因此草图方法被用于降低计算生物学软件包的计算需求。使用草图的应用程序通常依赖$k$-mer选择过程的特性,以确保使用草图不会降低与序列比对相比的结果质量。特别是,窗口保证确保序列中没有任何长区域在草图中未被表示。具有窗口保证的草图方法对应一个去环集,即一个不可避免的$k$-mer集合。任何足够长的序列必须包含来自任何去环集的$k$-mer(因此它是不可避免的)。反之,一个去环集通过从该集中选择$k$-mer来定义一种草图方法。尽管当前方法使用少数几种草图方法家族之一,但去环集的空间要大得多且基本未被探索。寻找具有理想特性的去环集是发现性能改进(例如,具有小窗口保证)的新草图方法的一种有前景的方法。最小去环集因其规模小而特别令人感兴趣。尽管存在大量替代的最小去环集,但已知只有两种算法(由Mykkeltveit和Champarnaud提出)能生成两个特定的最小去环集。我们提供了一种简单方法,允许探索最小去环集的空间并找到针对理想特性优化的集合。我们给出证据表明,Mykkeltveit集在一个特定特性——剩余路径长度方面接近最优。