String Indexing for Top-$k$ Close Consecutive Occurrences

The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair $(i,j)$, $i < j$, such that $P$ occurs at positions $i$ and $j$ in $S$ and there is no occurrence of $P$ between $i$ and $j$, and their distance is defined as $j-i$. Given a pattern $P$ and a parameter $k$, the goal is to report the top-$k$ consecutive occurrences of $P$ in $S$ of minimal distance. The challenge is to compactly represent $S$ while supporting queries in time close to the length of $P$ and $k$. We give three time-space trade-offs for the problem. Let $n$ be the length of $S$, $m$ the length of $P$, and $\epsilon\in(0,1]$. Our first result achieves $O(n\log n)$ space and optimal query time of $O(m+k)$. Our second and third results achieve linear space and query times either $O(m+k^{1+\epsilon})$ or $O(m + k \log^{1+\epsilon} n)$. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

翻译：经典的字符串索引问题是将字符串$S$预处理为紧凑的数据结构，以支持高效的后续模式匹配查询，即给定模式串$P$，报告$P$在$S$中的所有出现位置。本文研究字符串索引的一个基本且自然的扩展——面向Top-$k$最近连续出现的字符串索引问题（SITCCO）。其中，连续出现是指一对位置$(i,j)$（$i<j$），使得$P$在$S$的位置$i$和$j$处出现，且$i$和$j$之间没有$P$的其他出现，其距离定义为$j-i$。给定模式串$P$和参数$k$，目标是报告$P$在$S$中距离最小的前$k$个连续出现。挑战在于紧凑地表示$S$，同时支持查询时间接近$P$的长度和$k$。我们针对该问题给出了三种时空权衡方案。设$n$为$S$的长度，$m$为$P$的长度，$\epsilon\in(0,1]$。第一个结果实现了$O(n\log n)$空间和最优的$O(m+k)$查询时间。第二和第三个结果实现了线性空间，查询时间分别为$O(m+k^{1+\epsilon})$或$O(m + k \log^{1+\epsilon} n)$。在此过程中，我们开发了几种独立价值的技术，包括将问题转化为线段相交问题的新方法，以及针对树的新型递归聚类技术。

相关内容