In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access, the most basic query, can be supported in $O(\delta\log\frac{n\log\sigma}{\delta\log n})$ space (where $n$ is the text length, $\sigma$ is the alphabet size, and $\delta$ is text's substring complexity), which is the asymptotically smallest space to represent a string, for all $n$, $\sigma$, and $\delta$ (Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the powerful suffix array (SA) queries. The currently smallest one takes $O(r\log\frac{n}{r})$ space, where $r\geq\delta$ is the number of runs in the BWT of the text (Gagie, Navarro, Prezza; J. ACM 2020). We present a new compressed index that needs only $O(\delta\log\frac{n\log\sigma}{\delta\log n})$ space to support SA functionality in $O(\log^{4+\epsilon} n)$ time. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient for efficient SA queries. Our result immediately improves the space complexity of dozens of algorithms, which can now be executed in optimal compressed space. In addition, we show how to construct our index in $O(\delta\text{ polylog } n)$ time from the LZ77 parsing of the text. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm. To obtain our results, we develop numerous techniques of independent interest, including the first $O(\delta\log\frac{n\log\sigma}{\delta\log n})$-size index for LCE queries.
翻译:在过去的几十年里,处理海量文本数据的需求推动了压缩文本索引的发展:这类数据结构在支持对给定文本高效查询的同时,所占空间与文本的压缩表示成正比。压缩索引中一个普遍现象是,越强大的查询需要越大的索引空间。例如,最基本的随机访问查询可以在$O(\delta\log\frac{n\log\sigma}{\delta\log n})$空间内得到支持(其中$n$为文本长度,$\sigma$为字母表大小,$\delta$为文本的子串复杂度),这是对于所有$n$、$\sigma$和$\delta$而言表示字符串的渐进最小空间(Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023)。层次结构的另一端则由支持强大后缀数组(SA)查询的索引占据。当前最小的此类索引占用$O(r\log\frac{n}{r})$空间,其中$r\geq\delta$是文本BWT中游程的数量(Gagie, Navarro, Prezza; J. ACM 2020)。我们提出了一种新的压缩索引,仅需$O(\delta\log\frac{n\log\sigma}{\delta\log n})$空间即可在$O(\log^{4+\epsilon} n)$时间内支持SA功能。这使压缩数据结构的层次结构坍缩为单一节点:表示文本所需的空间同时足以支持高效的SA查询。我们的结果立即改善了数十种算法的空间复杂度,现在这些算法可以在最优压缩空间内执行。此外,我们展示了如何基于LZ77解析在$O(\delta\text{ polylog } n)$时间内构建该索引。对于高度重复的文本,这比先前的最佳算法快指数级。为获得这些结果,我们开发了多种具有独立意义的技术,包括首个用于LCE查询的$O(\delta\log\frac{n\log\sigma}{\delta\log n})$空间索引。