In recent years, there has been a renewed interest in the search for low density minimizer schemes. These schemes take a window of $w$ consecutive $k$-mers, and sample one of them: the smallest under some specific order. Schemes such as the mod-minimizer provide a low density (fraction of sampled $k$-mers) when $k \gg w$, while schemes such as the greedy minimizer work well for explicit small parameters roughly in the regime $k \leq 2w$, for $k$ and $w$ up to $15$ or so. When $k < \log_σw$ is very small, minimizer schemes cannot do well, and more general sampling schemes are needed that can be richer than just comparing $k$-mers. Bidirectional-string anchors (bd-anchors) form one such scheme. Inspired by bd-anchors, we introduce the smallest unique substring or SUS-anchor: Given a window, this considers all suffixes that do not occur as a substring elsewhere in the window. It then samples the start position of the smallest suffix according to the new anti-lexicographic order that minimizes the first character and maximizes the remaining characters. We give a linear-time and $O(w)$ space streaming algorithm to compute all SUS-anchors of a string. For alphabet size $σ=4$ and $k=1$, the anti-lexicographic SUS-anchor empirically has density $<1\%$ away from the density lower bound, significantly improving over bd-anchors that are often $>15\%$ above it. For alphabet size $σ=2$, the density is at most $10\%$ above the lower bound, which again improves over the $>50\%$ overhead of bd-anchors.
翻译:近年来,低密度最小化子采样方案的研究重新引起了关注。这类方案从一个包含w个连续k-mer的窗口中采样一个:即在特定顺序下的最小元素。诸如mod-minimizer等方案在k ≫ w时能够实现低密度(采样k-mer的比例),而greedy minimizer等方案则适用于k ≤ 2w(k和w通常在15左右)的显式小参数情况。当k < log_σ w非常小时,最小化子采样方案效果不佳,需要比单纯比较k-mer更为通用的采样方案。双向字符串锚点(bd-anchor)便是一种此类方案。受bd-anchor的启发,我们引入了最小唯一子串(SUS-anchor):给定一个窗口,该方案考虑所有不在窗口其他地方作为子串出现的后缀,然后根据新的反词典序(该顺序最小化第一个字符并最大化剩余字符)采样最小后缀的起始位置。我们提出了一种线性时间、O(w)空间的流式算法,用于计算字符串的所有SUS锚点。对于字母表大小σ=4且k=1,反词典序SUS锚点的经验密度与密度下界相差<1%,显著优于通常超过下界>15%的bd-anchor。对于字母表大小σ=2,密度最多比下界高10%,同样优于bd-anchor超过50%的额外开销。