Linear Time Construction of Cover Suffix Tree and Applications

The Cover Suffix Tree (CST) of a string $T$ is the suffix tree of $T$ with additional explicit nodes corresponding to halves of square substrings of $T$. In the CST an explicit node corresponding to a substring $C$ of $T$ is annotated with two numbers: the number of non-overlapping consecutive occurrences of $C$ and the total number of positions in $T$ that are covered by occurrences of $C$ in $T$. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-$n$ string in $O(n \log n)$ time. We show how to compute the CST in $O(n)$ time assuming that $T$ is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-$n$ string $T$, one can compute a linear-sized representation of all seeds of $T$ as well as all shortest $\alpha$-partial covers and seeds in $T$ for a given $\alpha$ in $O(n)$ time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an $O(n \log n)$-time algorithm for computing a shortest $\alpha$-partial cover for each $\alpha=1,\ldots,n$; we improve this complexity to $O(n)$. Our results are based on a new characterization of consecutive overlapping occurrences of a substring $S$ of $T$ in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in $T$. This new insight also leads to an $O(n)$-sized index for reporting overlapping consecutive occurrences of a given pattern $P$ of length $m$ in $O(m+output)$ time, where $output$ is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses $O(n \log n)$ space.

翻译：字符串 $T$ 的覆盖后缀树（CST）是 $T$ 的后缀树，并额外添加了对应于 $T$ 中平方子串一半的显式节点。在 CST 中，对应于 $T$ 的子串 $C$ 的显式节点标注有两个数值：$C$ 的非重叠连续出现次数，以及 $T$ 中被 $C$ 的出现所覆盖的位置总数。Kociumaka 等人（Algorithmica, 2015）已展示了如何在 $O(n \log n)$ 时间内计算长度为 $n$ 的字符串的 CST。我们展示了如何在 $O(n)$ 时间内计算 CST，前提是 $T$ 来自整数字母表。Kociumaka 等人（Algorithmica, 2015；Theor. Comput. Sci., 2018）指出，已知长度 $n$ 的字符串 $T$ 的 CST，可在 $O(n)$ 时间内计算 $T$ 的所有种子（seeds）的线性规模表示，以及 $T$ 中给定 $\alpha$ 的所有最短 $\alpha$-部分覆盖和种子。因此，我们的结果意味着存在线性时间算法来计算这些准周期概念。所得出的计算种子的算法与先前算法（Kociumaka 等人，SODA 2012, ACM Trans. Algorithms, 2020）有本质不同。Kociumaka 等人（Algorithmica, 2015）提出了一个 $O(n \log n)$ 时间的算法，用于计算每个 $\alpha=1,\ldots,n$ 的最短 $\alpha$-部分覆盖；我们将其复杂度改进为 $O(n)$。我们的结果基于对 $T$ 的子串 $S$ 的连续重叠出现的一种新刻画，该刻画依赖于 $T$ 中的 run 集合（参见 Kolpakov 和 Kucherov, FOCS 1999）。这一新见解还导致了一个 $O(n)$ 规模的索引，用于在 $O(m+output)$ 时间内报告给定模式 $P$（长度为 $m$）的重叠连续出现，其中 $output$ 是报告的出现次数。相比之下，Navarro 和 Thankachan（Theor. Comput. Sci., 2016）用于报告有界间隔连续出现的通用索引使用了 $O(n \log n)$ 空间。

相关内容

Algorithmica

关注 1

Algorithmica是一本国际性的期刊，它出版关于解决实际领域中出现的问题的算法的理论论文，以及对实际重要性或技术具有普遍吸引力的实验论文。算法的发展是计算机科学的一个组成部分。计算机应用的日益复杂和范围使得高效算法的设计必不可少。此外，该杂志还设有两个专区：应用经验、将理论成果应用到实际情况中的发现和问题、提供有关计算机科学选定主题的问题的短文。官网链接：https://link.springer.com/journal/453

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日