Internal Pattern Matching Queries in a Text and Applications

We consider several types of internal queries, that is, questions about fragments of a given text $T$ specified in constant space by their locations in $T$. Our main result is an optimal data structure for Internal Pattern Matching (IPM) queries which, given two fragments $x$ and $y$, ask for a representation of all fragments contained in $y$ and matching $x$ exactly; this problem can be viewed as an internal version of the Exact Pattern Matching problem. Our data structure answers IPM queries in time proportional to the quotient $|y|/|x|$ of fragments' lengths, which is required due to the information content of the output. If $T$ is a text of length $n$ over an integer alphabet of size $\sigma$, then our data structure occupies $O(n/ \log_\sigma n)$ machine words (that is, $O(n\log \sigma)$ bits) and admits an $O(n/ \log_\sigma n)$-time construction algorithm. We show the applicability of IPM queries for answering internal queries corresponding to other classic string processing problems. Among others, we derive optimal data structures reporting the periods of a fragment and testing the cyclic equivalence of two fragments. IPM queries have already found numerous further applications, following the path paved by the classic Longest Common Extension (LCE) queries of Landau and Vishkin (JCSS, 1988). In particular, IPM queries have been implemented in grammar-compressed and dynamic settings and, along with LCE queries, constitute elementary operations of the PILLAR model, developed by Charalampopoulos, Kociumaka, and Wellnitz (FOCS 2020). On the way to our main result, we provide a novel construction of string synchronizing sets of Kempa and Kociumaka (STOC 2019). Our method, based on a new restricted version of the recompression technique of Je\.z (J. ACM, 2016), yields a hierarchy of $O(\log n)$ string synchronizing sets covering the whole spectrum of fragments' lengths.

翻译：我们研究了多种内部查询问题，即关于给定文本 $T$ 的片段（通过其在 $T$ 中的位置以常数空间指定）的查询。我们的主要成果是针对内部模式匹配（IPM）查询的最优数据结构，该查询给定两个片段 $x$ 和 $y$，要求给出 $y$ 中所有与 $x$ 精确匹配的片段的表示；此问题可视为精确模式匹配问题的内部版本。我们的数据结构回答 IPM 查询所需时间与片段长度之比 $|y|/|x|$ 成正比，这是由输出的信息量所决定的。若 $T$ 是长度为 $n$、字符集大小为 $\sigma$ 的整数字母表上的文本，则我们的数据结构占用 $O(n/ \log_\sigma n)$ 个机器字（即 $O(n\log \sigma)$ 比特），并支持 $O(n/ \log_\sigma n)$ 时间的构造算法。我们展示了 IPM 查询在解决其他经典字符串处理问题的内部查询中的适用性。例如，我们推导出了报告片段周期及测试两个片段循环等价性的最优数据结构。IPM 查询已在 Landau 和 Vishkin（JCSS, 1988）提出的经典最长公共扩展（LCE）查询基础上，找到了众多进一步的应用。特别是，IPM 查询已在语法压缩和动态场景中实现，并与 LCE 查询共同构成了 Charalampopoulos、Kociumaka 和 Wellnitz（FOCS 2020）提出的 PILLAR 模型的基本操作。在获得主要结果的过程中，我们提供了一种新的字符串同步集构造方法（源自 Kempa 和 Kociumaka, STOC 2019）。该方法基于 Jeż（J. ACM, 2016）重压缩技术的一种受限新版本，生成了覆盖片段长度全谱的 $O(\log n)$ 级字符串同步集层次结构。

相关内容

IPM

关注 15

信息处理和管理（IPM）在计算机与信息科学的交叉点上发布了有关领域，包括但不限于商业、市场营销、广告、社交计算和信息技术等领域的理论、方法或应用的前沿研究。该杂志的目的是通过为及时传播高级和热门问题提供有效的论坛，从而在计算机与信息科学的交叉点上增进研究人员和从业人员的利益。该期刊对原始研究文章、研究调查文章、研究方法文章以及涉及研究关键应用的文章特别感兴趣。官网地址：http://dblp.uni-trier.de/db/journals/ipm/

【图机器学习进展与趋势@ICML2022】Graph Machine Learning @ ICML 2022

专知会员服务

40+阅读 · 2022年7月25日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日