Dynamic Suffix Array in Optimal Compressed Space

Big data, encompassing extensive datasets, has seen rapid expansion, notably with a considerable portion being textual data, including strings and texts. Simple compression methods and standard data structures prove inadequate for processing these datasets, as they require decompression for usage or consume extensive memory resources. Consequently, this motivation has led to the development of compressed data structures that support various queries for a given string, typically operating in polylogarithmic time and utilizing compressed space proportional to the string's length. Notably, the suffix array (SA) query is a critical component in implementing a suffix tree, which has a broad spectrum of applications. A line of research has been conducted on (especially, static) compressed data structures that support the SA query. A common finding from most of the studies is the suboptimal space efficiency of existing compressed data structures. Kociumaka, Navarro, and Prezza [IEEE Trans. Inf. Theory 2023] have made a significant contribution by introducing an asymptotically minimal space requirement, $O\left(\delta \log\frac{n\log\sigma}{\delta\log n} \log n \right)$ bits ($\delta$-optimal space), sufficient to represent any string of length $n$, with an alphabet size of $\sigma$, and substring complexity $\delta$, serving as a measure of repetitiveness. More recently, Kempa and Kociumaka [FOCS 2023] presented $\delta$-SA, a compressed data structure supporting SA queries in $\delta$-optimal space. However, the data structures introduced thus far are static. We present the first dynamic compressed data structure that supports the SA query and update in polylogarithmic time and $\delta$-optimal space. More precisely, it can answer SA queries and perform updates in $O(\log^7 n)$ and expected $O(\log^8 n)$ time, respectively, using an expected $\delta$-optimal space.

翻译：大数据，涵盖广泛的数据集，近年来迅速扩张，其中相当一部分是文本数据，包括字符串和文本。简单的压缩方法和标准数据结构在处理这些数据集时显得不足，因为它们需要解压缩才能使用或消耗大量内存资源。因此，这一动机推动了压缩数据结构的发展，这些结构支持对给定字符串进行各种查询，通常在对数多项式时间内运行，并使用与字符串长度成比例的压缩空间。值得注意的是，后缀数组（SA）查询是实现后缀树的关键组成部分，后者具有广泛的应用。一系列研究已经针对（特别是静态）支持SA查询的压缩数据结构展开。大多数研究的一个共同发现是现有压缩数据结构的空间效率未达最优。Kociumaka、Navarro和Prezza [IEEE Trans. Inf. Theory 2023] 做出了重要贡献，引入了渐近最小空间需求，$O\left(\delta \log\frac{n\log\sigma}{\delta\log n} \log n \right)$ 位（$\delta$-最优空间），足以表示任何长度为 $n$、字母表大小为 $\sigma$、子串复杂度为 $\delta$（作为重复性的度量）的字符串。最近，Kempa和Kociumaka [FOCS 2023] 提出了 $\delta$-SA，这是一种支持SA查询的压缩数据结构，使用 $\delta$-最优空间。然而，迄今为止引入的数据结构都是静态的。我们提出了第一个动态压缩数据结构，该结构支持SA查询和更新，在对数多项式时间内运行并使用 $\delta$-最优空间。更具体地说，它能够在 $O(\log^7 n)$ 时间内回答SA查询，在期望 $O(\log^8 n)$ 时间内执行更新，同时使用期望的 $\delta$-最优空间。