Sorting Finite Automata via Partition Refinement

Wheeler nondeterministic finite automata (WNFAs) were introduced as a generalization of prefix sorting from strings to labeled graphs. WNFAs admit optimal solutions to classic hard problems on labeled graphs and languages. The problem of deciding whether a given NFA is Wheeler is known to be NP-complete. Recently, however, Alanko et al. showed how to side-step this complexity by switching to preorders: letting $Q$ be the set of states, $E$ the set of transitions, $|Q|=n$, and $|E|=m$, they provided a $O(mn^2)$-time algorithm computing a totally-ordered partition of the WNFA's states such that (1) equivalent states recognize the same regular language, and (2) the order of non-equivalent states is consistent with any Wheeler order, when one exists. Then, the output is a preorder of the states as useful for pattern matching as standard Wheeler orders. Further research generalized these concepts to arbitrary NFAs by introducing co-lex partial preorders: any NFA admits a partial preorder of its states reflecting the co-lex order of their accepted strings; the smaller the width of such preorder is, the faster regular expression matching queries can be performed. To date, the fastest algorithm for computing the smallest-width partial preorder on NFAs runs in $O(m^2+n^{5/2})$ time, while on DFAs the same can be done in $O(\min(n^2\log n,mn))$ time. In this paper, we provide much more efficient solutions to the problem above. Our results are achieved by extending a classic algorithm for the relational coarsest partition refinement problem to work with ordered partitions. Specifically, we provide a $O(m\log n)$-time algorithm computing a co-lex total preorder when the input is a WNFA, and an algorithm with the same time complexity computing the smallest-width co-lex partial order of any DFA. Also, we present implementations of our algorithms and show that they are very efficient in practice.

翻译：Wheeler非确定性有限自动机（WNFA）被引入作为从字符串到带标签图的排序前缀的推广。WNFA在带标签图和语言上为经典困难问题提供了最优解。判断给定NFA是否为Wheeler自动机的问题已知是NP完全的。然而，最近Alanko等人展示了如何通过转向预序来规避这一复杂性：设$Q$为状态集，$E$为转移集，$|Q|=n$，$|E|=m$，他们提出了一种$O(mn^2)$时间算法，计算WNFA状态的全序划分，使得（1）等价状态识别相同的正则语言，（2）非等价状态的顺序与任何Wheeler序（若存在）一致。随后，输出结果是一个状态预序，其对于模式匹配的实用性等同于标准Wheeler序。进一步的研究通过引入共词法偏序将这些概念推广到任意NFA：任意NFA都承认其状态的一个偏序，反映其接受字符串的共词法序；该偏序的宽度越小，正则表达式匹配查询的执行速度越快。迄今为止，计算NFA上最小宽度偏序的最快算法运行时间为$O(m^2+n^{5/2})$，而在DFA上相同任务可在$O(\min(n^2\log n,mn))$时间内完成。在本文中，我们为上述问题提供了高效得多的解决方案。我们的成果是通过扩展经典的关系统粗粒度划分细化算法以处理有序划分而实现的。具体来说，我们提出了一个$O(m\log n)$时间算法，用于在输入为WNFA时计算共词法全预序，以及一个具有相同时间复杂度的算法，用于计算任意DFA的最小宽度共词法偏序。此外，我们给出了算法的实现，并展示了它们在实际中的高效率。