Sorted Consecutive Occurrence Queries in Substrings

The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text $T$ of length $n$ for preprocessing and a pattern $P$ of length $m$ as a query, the goal is to report all occurrences of $P$ as substrings of $T$. Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of $P$ in $T$ such that their gaps (i.e., the distances between them) lie within a query-specified range $[g_1, g_2]$. Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top-$k$ close consecutive occurrence query, which reports the $k$ closest consecutive occurrences of $P$ in $T$, sorted in non-descending order of distance. Both problems are optimally solved in query time with $O(n \log n)$-space data structures. In this paper, we generalize these problems to the range query model, which focuses only on occurrences of $P$ in a specified substring $T[a.. b]$ of $T$. Our contributions are as follows: (1) We propose an $O(n \log^2 n)$-space data structure that answers the range top-$k$ consecutive occurrence query in $O(|P| + \log\log n + k)$ time. (2) We propose an $O(n \log^{2+\epsilon} n)$-space data structure that answers the range gap-bounded consecutive occurrence query in $O(|P| + \log\log n + \mathit{output})$ time, where $\epsilon$ is a positive constant and $\mathit{output}$ denotes the number of outputs. Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.

翻译：字符串索引问题是一个基础的计算问题，在信息检索和生物信息学等领域有广泛应用。其核心目标是高效解决模式匹配问题：给定一个长度为 $n$ 的文本 $T$ 用于预处理，以及一个长度为 $m$ 的模式 $P$ 作为查询，目标是报告 $P$ 作为 $T$ 的子串出现的所有位置。Navarro 和 Thankachan [CPM 2015, Theor. Comput. Sci. 2016] 引入了该问题的一个变体，称为间隙有界连续出现查询，它报告 $P$ 在 $T$ 中满足其间隙（即它们之间的距离）位于查询指定范围 $[g_1, g_2]$ 内的连续出现对。最近，Bille 等人 [FSTTCS 2020, Theor. Comput. Sci. 2022] 提出了前 $k$ 近连续出现查询，它报告 $P$ 在 $T$ 中距离最近（按距离非降序排序）的 $k$ 个连续出现。这两个问题均可在 $O(n \log n)$ 空间的数据结构下以最优查询时间解决。在本文中，我们将这些问题推广到范围查询模型，该模型仅关注 $P$ 在 $T$ 的指定子串 $T[a.. b]$ 中的出现。我们的贡献如下：(1) 我们提出了一个 $O(n \log^2 n)$ 空间的数据结构，能够在 $O(|P| + \log\log n + k)$ 时间内回答范围前 $k$ 连续出现查询。(2) 我们提出了一个 $O(n \log^{2+\epsilon} n)$ 空间的数据结构，能够在 $O(|P| + \log\log n + \mathit{output})$ 时间内回答范围间隙有界连续出现查询，其中 $\epsilon$ 是一个正常数，$\mathit{output}$ 表示输出结果的数量。此外，作为副产品，我们提出了解决涉及二维平面中加权水平线段的几何问题的算法，这些算法本身也具有独立的研究价值。