R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string $T$ of length $n$ in $O(r)$ words of compressed working space, where $r \le n$ is the number of runs in the Burrows-Wheeler transform (BWT) of $T$. Given the run-length encoded BWT (RLBWT) of $T$, r-enum runs in $O(n \log \log_{w} (n/r))$ time in addition to the time linear to the number of output strings, where $w = Θ(\log n)$ is the word size. In this paper, we first improve the $O(n \log \log_{w} (n/r))$ term to $O(n)$. We next extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, as well as the context diversity for every maximal repeat in the same complexities. Furthermore, we study net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is a repeat with a positive net frequency. Given the RLBWT of $T$, we show how to compute the set $S^{nsmr}$ of all NSMRs in $T$ together with their net frequency/occurrences in $O(n)$ time and $O(r)$ space. We also show that an $O(r)$-space data structure can be built from the RLBWT to compute the net frequency/occurrences of any pattern in optimal time. The data structure is built in $O(r)$ space and in $O(n)$ time with high probability or deterministic $O(n + |S^{nsmr}| \log \log \min(σ, |S^{nsmr}|))$ time, where $σ\le r$ is the alphabet size of $T$. To achieve this, we prove that the total number of net occurrences is less than $2r$. With the duality between net occurrences and \emph{minimal unique substrings (MUSs)}, we get a new upper bound $2r$ of the number of MUSs in $T$, which may be of independent interest.

翻译：Nishimoto 与 Tabei [CPM, 2021] 提出了 r-enum 算法，用于枚举字符串 $T$（长度为 $n$）中的各种特征子串，包括极大重复串。该算法在 $O(r)$ 字的压缩工作空间内运行，其中 $r \le n$ 是 $T$ 的 Burrows-Wheeler 变换 (BWT) 中的游程数。给定 $T$ 的游程编码 BWT (RLBWT)，r-enum 的运行时间除了与输出字符串数量呈线性关系外，还需额外 $O(n \log \log_{w} (n/r))$ 时间，其中 $w = Θ(\log n)$ 为字长。本文首先将 $O(n \log \log_{w} (n/r))$ 项改进为 $O(n)$。其次，我们将 r-enum 扩展以计算其他上下文敏感重复串，例如近超极大重复串 (NSMRs) 和超极大重复串，并在相同复杂度下计算每个极大重复串的上下文多样性。此外，我们研究了净出现：如果一个重复串的出现不被其他任何重复串覆盖，则称其为净出现；一个重复串的净频率即其净出现的次数。在此术语下，一个 NSMR 即净频率为正的重复串。给定 $T$ 的 RLBWT，我们展示了如何在 $O(n)$ 时间和 $O(r)$ 空间内计算 $T$ 中所有 NSMRs 的集合 $S^{nsmr}$ 及其净频率/出现。我们还展示了一个 $O(r)$ 空间的数据结构可以从 RLBWT 构建，以在最优时间内计算任意模式的净频率/出现。该数据结构以 $O(r)$ 空间构建，构建时间在高概率下为 $O(n)$，或确定性地为 $O(n + |S^{nsmr}| \log \log \min(σ, |S^{nsmr}|))$，其中 $σ\le r$ 是 $T$ 的字母表大小。为实现此目标，我们证明了净出现的总数小于 $2r$。利用净出现与 \emph{最小唯一子串 (MUSs)} 之间的对偶性，我们得到了 $T$ 中 MUSs 数量的新上界 $2r$，这可能具有独立的研究意义。