We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of $\mathcal{T}$ incrementally over $\mathcal{G}$ using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build $\mathcal{G}$ from $\mathcal{T}$ in linear time and space. We also demonstrate that our MEM algorithm runs on top of $\mathcal{G}$ in $O(G +occ)$ time and uses $O(\log G(G+occ))$ bits, where $G$ is the grammar size, and $occ$ is the number of MEMs in $\mathcal{T}$. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.
翻译:我们描述了一种压缩感知方法,用于计算重复集合 $\mathcal{T}$ 中字符串之间的全对全最大精确匹配(MEM)。我们工作的关键概念是从 $\mathcal{T}$ 构建一个完全平衡的语法 $\mathcal{G}$,该语法满足我们称为无修复(fix-free)的性质:解析树中具有相同高度的非终结符的扩展构成一个无修复集合(即前缀无关和后缀无关)。无修复性质允许我们使用一种标准的基于后缀树的 MEM 算法,通过 $\mathcal{G}$ 增量式地计算 $\mathcal{T}$ 的 MEM 集合,该算法每次只处理语法规则的一个子集,并且不解压非终结符。通过修改 Christiansen 等人 2020 年提出的局部一致语法,我们展示了如何在线性时间和空间内从 $\mathcal{T}$ 构建 $\mathcal{G}$。我们还证明了我们的 MEM 算法在 $\mathcal{G}$ 上以 $O(G + \textit{occ})$ 时间运行,并使用 $O(\log G(G + \textit{occ}))$ 位空间,其中 $G$ 是语法大小,$\textit{occ}$ 是 $\mathcal{T}$ 中的 MEM 数量。在结论部分,我们讨论了如何修改我们的想法以在压缩空间内实现近似模式匹配。