In this paper, we present the first study of the computational complexity of converting an automata-based text index structure, called the Compact Directed Acyclic Word Graph (CDAWG), of size $e$ for a text $T$ of length $n$ into other text indexing structures for the same text, suitable for highly repetitive texts: the run-length BWT of size $r$, the irreducible PLCP array of size $r$, and the quasi-irreducible LPF array of size $e$, as well as the lex-parse of size $O(r)$ and the LZ77-parse of size $z$, where $r, z \le e$. As main results, we showed that the above structures can be optimally computed from either the CDAWG for $T$ stored in read-only memory or its self-index version of size $e$ without a text in $O(e)$ worst-case time and words of working space. To obtain the above results, we devised techniques for enumerating a particular subset of suffixes in the lexicographic and text orders using the forward and backward search on the CDAWG by extending the results by Belazzougui et al. in 2015.
翻译:本文首次研究了将基于自动机的文本索引结构——紧凑型有向无环词图(CDAWG)(大小为e,针对长度为n的文本T)转换为同一文本的其他文本索引结构的计算复杂度,这些结构适用于高度重复的文本:大小为r的游程长度BWT、大小为r的不可约PLCP数组、大小为e的拟不可约LPF数组,以及大小为O(r)的lex-parse和大小为z的LZ77-parse(其中r, z ≤ e)。作为主要结果,我们证明了上述结构可以从存储于只读存储器中的T的CDAWG或其大小为e的无文本自索引版本,在O(e)最坏时间复杂度和单词工作空间内最优地计算出来。为获得上述结果,我们通过扩展Belazzougui等人2015年的成果,设计了利用CDAWG上前向和后向搜索来按字典序和文本序枚举特定后缀子集的技术。