Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these techniques were extended to sets of strings described by means of finite automata: the theory of Wheeler graphs [Gagie et al., TCS'17] introduced automata whose states can be totally-sorted according to the co-lexicographic (co-lex in the following) order of the prefixes of words accepted by the automaton. More recently, in [Cotumaccio, Prezza, SODA'21] it was shown how to extend these ideas to arbitrary automata by means of partial co-lex orders. This work showed that a co-lex order of minimum width (thus optimizing search query times) on deterministic finite automata (DFAs) can be computed in $O(m^2 + n^{5/2})$ time, $m$ being the number of transitions and $n$ the number of states of the input DFA. In this paper, we exhibit new combinatorial properties of the minimum-width co-lex order of DFAs and exploit them to design faster prefix sorting algorithms. In particular, we describe two algorithms sorting arbitrary DFAs in $O(mn)$ and $O(n^2\log n)$ time, respectively, and an algorithm sorting acyclic DFAs in $O(m\log n)$ time. Within these running times, all algorithms compute also a smallest chain partition of the partial order (required to index the DFA). We present an experiment result to show that an optimized implementation of the $O(n^2\log n)$-time algorithm exhibits a nearly-linear behaviour on large deterministic pan-genomic graphs and is thus also of practical interest.
翻译:排序是一种基础性的算法预处理技术,它通常能够使数据表示更紧凑,同时加快对其的搜索查询速度。本文聚焦于字符串集排序与索引这一被广泛研究的问题。自1973年后缀树提出以来,文献中已经描述了数十种后缀排序算法。2017年,这些技术被扩展到由有限自动机描述的字符串集合:Wheeler图理论[Gagie et al., TCS'17]引入了其状态可根据自动机所接受单词的前缀的逆字典序(以下简称co-lex序)进行全排序的自动机。近期,[Cotumaccio, Prezza, SODA'21]展示了如何通过部分co-lex序将这些思想扩展到任意自动机。该工作表明,可以在$O(m^2 + n^{5/2})$时间内计算确定性有限自动机(DFA)上宽度最小(从而优化搜索查询时间)的co-lex序,其中$m$是输入DFA的转移数,$n$是状态数。本文揭示了DFA最小宽度co-lex序的新组合性质,并利用它们设计更快速的前缀排序算法。具体而言,我们分别描述了两种算法:一种在$O(mn)$时间内排序任意DFA,另一种在$O(n^2\log n)$时间内排序,以及一种在$O(m\log n)$时间内排序无环DFA的算法。在这些运行时间内,所有算法还计算了该偏序的最小链划分(这是对DFA建立索引所必需的)。我们通过实验结果表明,$O(n^2\log n)$时间算法的一个优化实现在大规模确定性泛基因组图上表现出近线性的行为,因此也具有实际应用价值。