A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate pattern matching queries of several types. In particular, $k$-errata trees yield an elegant solution to $k$-mismatch queries, where we are to report all substrings of the text with Hamming distance at most $k$ to the query pattern. The resulting $k$-mismatch index uses $O(n\log^k n)$ space and answers a query for a length-$m$ pattern in $O(\log^k n \log \log n + m + occ)$ time, where $occ$ is the number of approximate occurrences. In retrospect, $k$-errata trees appear very well optimized: even though a large body of work has adapted $k$-errata trees to various settings throughout the past two decades, the original time-space trade-off for $k$-mismatch indexing has not been improved in the general case. We present the first such improvement, a $k$-mismatch index with $O(n\log^{k-1} n)$ space and the same query time as $k$-errata trees. Previously, due to a result of Chan, Lam, Sung, Tam, and Wong (Algorithmica 2010), such an $O(n\log^{k-1} n)$-size index has been known only for texts over alphabets of constant size. In this setting, however, we obtain an even smaller $k$-mismatch index of size only $O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$ for $2\le k\le O(1)$ and any constant $\varepsilon>0$. Along the way, we also develop improved indexes for short patterns, offering better trade-offs in this practically relevant special case.
翻译:字符串处理的核心任务之一是文本索引,其目标是将文本(长度为$n$的字符串)预处理为高效的索引(一种数据结构),以支持对文本的查询。Cole、Gottlieb和Lewenstein(STOC 2004)提出了k-错误树,这是一种支持多种类型近似模式匹配查询的文本索引族。特别地,k-错误树为k-失配查询提供了一种优雅的解决方案,即要求报告文本中所有与查询模式汉明距离不超过$k$的子串。由此得到的k-失配索引使用$O(n\log^k n)$空间,并能在$O(\log^k n \log \log n + m + occ)$时间内回答长度为$m$的模式的查询,其中$occ$是近似匹配的出现次数。回顾来看,k-错误树似乎已经过高度优化:尽管过去二十年有大量研究将k-错误树适配到各种场景,但k-失配索引的原始时空权衡在一般情况下尚未得到改进。我们提出了首个此类改进,即一个空间为$O(n\log^{k-1} n)$且查询时间与k-错误树相同的k-失配索引。此前,根据Chan、Lam、Sung、Tam和Wong(Algorithmica 2010)的结果,这种$O(n\log^{k-1} n)$大小的索引仅针对字母表大小恒定的文本已知。然而,在此设定下,我们获得了更小的k-失配索引,其大小仅为$O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$,适用于$2\le k\le O(1)$和任意常数$\varepsilon>0$。在此过程中,我们还为短模式开发了改进的索引,为这一实际相关的特殊情况提供了更好的权衡。