Counting Distinct (Non-)Crossing Substrings in Optimal Time

Let $w$ be a string of length $n$. The problem of counting factors crossing a position -- Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Lecroq, and Rytter, 2021] -- asks to count the number $\mathcal{C}(w,k)$ (resp. $\mathcal{N}(w,k)$) of distinct substrings in $w$ that have occurrences containing (resp. not containing) a position $k$ in $w$. The solutions provided in their textbook compute $\mathcal{C}(w,k)$ and $\mathcal{N}(w,k)$ in $O(n)$ time for a single position $k$ in $w$, and thus a direct application would require $O(n^2)$ time for all positions $k = 1, \ldots, n$ in $w$. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute $\mathcal{C}(w,k)$ in $O(n)$ total time for general ordered alphabets, and $\mathcal{N}(w,k)$ in $O(n)$ total time for linearly sortable alphabets,for all positions $k = 1, \ldots, n$ in $w$. We further derive model-dependent optimal bounds by separating the algorithms into preprocessing and linear-time postprocessing: for $\mathcal{C}$ the preprocessing is run reporting, and for $\mathcal{N}$ it is preprocessing based on longest previous non-overlapping factors (LPnF) and longest next factors (LNF). In particular, all values $\mathcal{C}(w,k)$ can be computed in $O(n\log n)$ time over general unordered alphabets in which direct accesses to alphabet characters are restricted to equality tests, and in $O(n\logσ)$ time in the word RAM model, where $σ$ denotes the number of distinct characters occurring in $w$. For $\mathcal{N}(w,k)$, the equality-testing complexity over general unordered alphabets is $Θ(n^2)$. We also show that our upper bounds are optimal for all of the aforementioned alphabet assumptions and computation models.

翻译：设 $w$ 为长度为 $n$ 的字符串。问题“计数跨越某一位置的因子”——源自教科书《125个文本算法问题》（[Crochemore, Lecroq, and Rytter, 2021]）中的第64题——要求统计 $w$ 中具有包含（或不包含）位置 $k$ 的出现次数的不同子串数量，分别记为 $\mathcal{C}(w,k)$ 和 $\mathcal{N}(w,k)$。该教科书提供的解法可在 $O(n)$ 时间内计算 $w$ 中单个位置 $k$ 的 $\mathcal{C}(w,k)$ 和 $\mathcal{N}(w,k)$，因此直接应用于 $w$ 中所有位置 $k = 1, \ldots, n$ 将需要 $O(n^2)$ 时间。该解法针对常数大小字母表设计。本文提出新算法，可在 $O(n)$ 总时间内计算一般有序字母表上的 $\mathcal{C}(w,k)$，并在 $O(n)$ 总时间内计算线性可排序字母表上的 $\mathcal{N}(w,k)$，其中 $k$ 取遍 $w$ 中所有位置 $k = 1, \ldots, n$。我们进一步通过将算法分为预处理和线性时间后处理来推导模型依赖的最优界：对于 $\mathcal{C}$，预处理基于游程报告；对于 $\mathcal{N}$，预处理基于最长前次非重叠因子（LPnF）和最长后次因子（LNF）。特别地，在一般无序字母表（仅允许通过相等性测试访问字符）上，所有 $\mathcal{C}(w,k)$ 值可在 $O(n\log n)$ 时间内计算；在字RAM模型上，可在 $O(n\logσ)$ 时间内计算，其中 $σ$ 表示 $w$ 中出现的不同字符数量。对于 $\mathcal{N}(w,k)$，在一般无序字母表上的相等性测试复杂度为 $Θ(n^2)$。我们还证明，上述所有字母表假设和计算模型下的上界均为最优。