A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $\Theta(n)$ that can output the set $\mathsf{MAW}(S)$ of all MAWs for a given string $S$ of length $n$ in $O(n + |\mathsf{MAW}(S)|)$ time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output $\mathsf{MAW}(S)$ in $O(|\mathsf{MAW}(S)|)$ time with $O(\mathsf{e}_\min)$ space, where $\mathsf{e}_\min$ denotes the minimum of the sizes of the CDAWGs for $S$ and for its reversal $S^R$. For any strings of length $n$, it holds that $\mathsf{e}_\min < 2n$, and for highly repetitive strings $\mathsf{e}_\min$ can be sublinear (up to logarithmic) in $n$. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
翻译:字符串$w$被称为字符串$S$的最小缺失词(MAW),若$w$不在$S$中出现,且$w$的任何真子串均在$S$中出现。我们关注长度至少为2的非平凡MAW。为给定字符串寻找此类非平凡MAW的动机源于生物信息学和数据压缩应用。Fujishige等人[TCS 2023]提出了一种规模为$\Theta(n)$的数据结构,该结构基于有向无环词图(DAWG),可在$O(n + |\mathsf{MAW}(S)|)$时间内输出给定长度为$n$的字符串$S$的所有MAW集合$\mathsf{MAW}(S)$。本文提出一种基于紧凑DAWG(CDAWG)的更节省空间的数据结构,该结构可在$O(|\mathsf{MAW}(S)|)$时间内输出$\mathsf{MAW}(S)$,且空间复杂度为$O(\mathsf{e}_\min)$,其中$\mathsf{e}_\min$表示$S$及其逆串$S^R$的CDAWG规模的最小值。对于任意长度为$n$的字符串,恒有$\mathsf{e}_\min < 2n$,而对于高度重复的字符串,$\mathsf{e}_\min$可达到$n$的次线性(最多对数级)。我们还通过CDAWG证明了MAW及其推广形式——最小稀有词与扩展双特殊因子之间具有密切关联。