Nearly Optimal Internal Dictionary Matching

We study the internal dictionary matching (IDM) problem where a dictionary $\mathcal{D}$ containing $d$ substrings of a text $T$ is given, and each query concerns the occurrences of patterns in $\mathcal{D}$ in another substring of $T.$ We propose a novel $O(n)$-sized data structure named Basic Substring Structure (BASS) where $n$ is the length of the text $T.$ With BASS, we are able to handle all types of queries in the IDM problem in nearly optimal query and preprocessing time. Specifically, our results include: - The first algorithm that answers the *CountDistinct* query in $\tilde{O}(1)$ time with $\tilde{O}(n+d)$ preprocessing, where we need to compute the number of distinct patterns that exist in $T[i..j]$. Previously, the best result was $\tilde{O}(m)$ time per query after $\tilde{O}(n^2/m+d)$ or $\tilde{O}(nd/m+d)$ preprocessing, where $m$ is a chosen parameter. - Faster algorithms for two other types of internal queries. We improve the runtime for \textbf{(1)} Pattern counting (Count) queries to $O(\log n/\log\log n)$ time per query with $O(n+d\sqrt{\log n})$ preprocessing from $O(\log^2 n/\log\log n)$ time per query with $O(n\log n/\log \log n+d\log^{3/2} n)$ preprocessing. \textbf{(2)} Distinct pattern reporting (ReportDistinct) queries to $O(1+|\text{output}|)$ time per query from $O(\log n+|\text{output}|)$ per query. In addition, we match the optimal runtime in the remaining two types of queries, pattern existence (Exist), and pattern reporting (Report). We also show that BASS is more generally applicable to other internal query problems.

翻译：我们研究内部字典匹配（IDM）问题，其中给定一个包含文本 $T$ 的 $d$ 个子串的字典 $\mathcal{D}$，每个查询关注字典 $\mathcal{D}$ 中的模式在 $T$ 的另一个子串中的出现情况。我们提出了一种新颖的、大小为 $O(n)$ 的数据结构，称为基本子串结构（BASS），其中 $n$ 是文本 $T$ 的长度。利用 BASS，我们能够以近乎最优的查询和预处理时间处理 IDM 问题中的所有类型查询。具体来说，我们的结果包括：- 第一个能在 $\tilde{O}(1)$ 时间内回答 *CountDistinct* 查询的算法，其预处理时间为 $\tilde{O}(n+d)$，该查询需要计算存在于 $T[i..j]$ 中的不同模式的数量。此前的最佳结果是每次查询 $\tilde{O}(m)$ 时间，预处理时间为 $\tilde{O}(n^2/m+d)$ 或 $\tilde{O}(nd/m+d)$，其中 $m$ 是一个选定的参数。- 针对另外两种内部查询的更快算法。我们将 \textbf{(1)} 模式计数（Count）查询的每次查询运行时间改进为 $O(\log n/\log\log n)$，预处理时间为 $O(n+d\sqrt{\log n})$，而之前为每次查询 $O(\log^2 n/\log\log n)$ 时间，预处理时间为 $O(n\log n/\log \log n+d\log^{3/2} n)$。\textbf{(2)} 不同模式报告（ReportDistinct）查询的每次查询运行时间改进为 $O(1+|\text{输出}|)$，而之前为每次查询 $O(\log n+|\text{输出}|)$。此外，在其余两种查询类型，即模式存在性（Exist）和模式报告（Report）中，我们匹配了最优运行时间。我们还展示了 BASS 更广泛地适用于其他内部查询问题。