An occurrence of a repeated substring $u$ in a string $S$ is called a net occurrence if extending the occurrence to the left or to the right decreases the number of occurrences to 1. The net frequency (NF) of a repeated substring $u$ in a string $S$ is the number of net occurrences of $u$ in $S$. Very recently, Guo et al. [SPIRE 2024] proposed an online $O(n \log σ)$-time algorithm that maintains a data structure of $O(n)$ space which answers Single-NF queries in $O(m\log σ+ σ^2)$ time and reports all answers of the All-NF problem in $O(nσ^2)$ time. Here, $n$ is the length of the input string $S$, $m$ is the query pattern length, and $σ$ is the alphabet size. The $σ^2$ term is a major drawback of their method since computing string net frequencies is originally motivated for Chinese language processing where $σ$ can be thousands large. This paper presents an improved online $O(n \log σ)$-time algorithm, which answers Single-NF queries in $O(m \log σ)$ time and reports all answers to the All-NF problem in output-optimal $O(|\mathsf{NF}^+(S)|)$ time, where $\mathsf{NF}^+(S)$ is the set of substrings of $S$ paired with their positive NF values. We note that $|\mathsf{NF}^+(S)| = O(n)$ always holds. In contract to Guo et al.'s algorithm that is based on Ukkonen's suffix tree construction, our algorithm is based on Weiner's suffix tree construction.
翻译:字符串 $S$ 中重复子串 $u$ 的一个出现被称为净出现,如果将该出现向左或向右扩展会使其出现次数减少为 1。字符串 $S$ 中重复子串 $u$ 的净频率(NF)是 $u$ 在 $S$ 中净出现的次数。最近,Guo 等人 [SPIRE 2024] 提出了一种在线 $O(n \log σ)$ 时间算法,该算法维护一个 $O(n)$ 空间的数据结构,能够在 $O(m\log σ+ σ^2)$ 时间内回答单次净频率查询,并在 $O(nσ^2)$ 时间内报告所有净频率问题的答案。其中,$n$ 是输入字符串 $S$ 的长度,$m$ 是查询模式串的长度,$σ$ 是字母表大小。$σ^2$ 项是他们方法的一个主要缺点,因为计算字符串净频率最初是受中文处理的启发,而中文的 $σ$ 可能高达数千。本文提出了一种改进的在线 $O(n \log σ)$ 时间算法,该算法能够在 $O(m \log σ)$ 时间内回答单次净频率查询,并以输出最优的 $O(|\mathsf{NF}^+(S)|)$ 时间报告所有净频率问题的答案,其中 $\mathsf{NF}^+(S)$ 是 $S$ 中与其正净频率值配对的子串集合。我们注意到 $|\mathsf{NF}^+(S)| = O(n)$ 始终成立。与 Guo 等人基于 Ukkonen 后缀树构造的算法不同,我们的算法基于 Weiner 的后缀树构造。