空间高效的k-失配文本索引 (Space-Efficient k-Mismatch Text Indexes)

A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate pattern matching queries of several types. In particular, $k$-errata trees yield an elegant solution to $k$-mismatch queries, where we are to report all substrings of the text with Hamming distance at most $k$ to the query pattern. The resulting $k$-mismatch index uses $O(n\log^k n)$ space and answers a query for a length-$m$ pattern in $O(\log^k n \log \log n + m + occ)$ time, where $occ$ is the number of approximate occurrences. In retrospect, $k$-errata trees appear very well optimized: even though a large body of work has adapted $k$-errata trees to various settings throughout the past two decades, the original time-space trade-off for $k$-mismatch indexing has not been improved in the general case. We present the first such improvement, a $k$-mismatch index with $O(n\log^{k-1} n)$ space and the same query time as $k$-errata trees. Previously, due to a result of Chan, Lam, Sung, Tam, and Wong (Algorithmica 2010), such an $O(n\log^{k-1} n)$-size index has been known only for texts over alphabets of constant size. In this setting, however, we obtain an even smaller $k$-mismatch index of size only $O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$ for $2\le k\le O(1)$ and any constant $\varepsilon>0$. Along the way, we also develop improved indexes for short patterns, offering better trade-offs in this practically relevant special case.

翻译：字符串处理的核心任务之一是文本索引，其目标是将文本（长度为$n$的字符串）预处理为高效的索引（一种数据结构），以支持对文本的查询。Cole、Gottlieb和Lewenstein（STOC 2004）提出了k-错误树，这是一种支持多种类型近似模式匹配查询的文本索引族。特别地，k-错误树为k-失配查询提供了一种优雅的解决方案，即要求报告文本中所有与查询模式汉明距离不超过$k$的子串。由此得到的k-失配索引使用$O(n\log^k n)$空间，并能在$O(\log^k n \log \log n + m + occ)$时间内回答长度为$m$的模式的查询，其中$occ$是近似匹配的出现次数。回顾来看，k-错误树似乎已经过高度优化：尽管过去二十年有大量研究将k-错误树适配到各种场景，但k-失配索引的原始时空权衡在一般情况下尚未得到改进。我们提出了首个此类改进，即一个空间为$O(n\log^{k-1} n)$且查询时间与k-错误树相同的k-失配索引。此前，根据Chan、Lam、Sung、Tam和Wong（Algorithmica 2010）的结果，这种$O(n\log^{k-1} n)$大小的索引仅针对字母表大小恒定的文本已知。然而，在此设定下，我们获得了更小的k-失配索引，其大小仅为$O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$，适用于$2\le k\le O(1)$和任意常数$\varepsilon>0$。在此过程中，我们还为短模式开发了改进的索引，为这一实际相关的特殊情况提供了更好的权衡。