A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $\Theta(n)$ that can output the set $\mathsf{MAW}(S)$ of all MAWs for a given string $S$ of length $n$ in $O(n + |\mathsf{MAW}(S)|)$ time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output $\mathsf{MAW}(S)$ in $O(|\mathsf{MAW}(S)|)$ time with $O(e)$ space, where $e$ denotes the minimum of the sizes of the CDAWGs for $S$ and for its reversal $S^R$. For any strings of length $n$, it holds that $e < 2n$, and for highly repetitive strings $e$ can be sublinear (up to logarithmic) in $n$. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
翻译:字符串 $w$ 被称为字符串 $S$ 的最小缺失词 (MAW),若 $w$ 不在 $S$ 中出现,但 $w$ 的所有真子串均在 $S$ 中出现。我们关注长度至少为2的非平凡MAW。对于给定字符串寻找此类非平凡MAW的动机源于生物信息学与数据压缩中的应用。Fujishige 等人 [TCS 2023] 基于有向无环单词图 (DAWG) 提出了一种大小为 $\Theta(n)$ 的数据结构,该结构可在 $O(n + |\mathsf{MAW}(S)|)$ 时间内输出给定长度 $n$ 的字符串 $S$ 的所有MAW集合 $\mathsf{MAW}(S)$。本文提出一种基于紧凑DAWG (CDAWG) 的更具空间效率的数据结构,该结构可在 $O(|\mathsf{MAW}(S)|)$ 时间内以 $O(e)$ 空间输出 $\mathsf{MAW}(S)$,其中 $e$ 表示 $S$ 及其逆序 $S^R$ 的CDAWG大小的最小值。对于任意长度为 $n$ 的字符串,有 $e < 2n$;而对于高度重复的字符串,$e$ 可呈亚线性(直至对数级)。我们还通过CDAWG证明了MAW及其推广形式——最小稀有词——与扩展双特殊因子之间存在密切关系。