Deterministic finite automata (DFA) are a classic tool for high throughput matching of regular expressions, both in theory and practice. Due to their high space consumption, extensive research has been devoted to compressed representations of DFAs that still support efficient pattern matching queries. Kumar~et~al.~[SIGCOMM 2006] introduced the \emph{delayed deterministic finite automaton} (\ddfa{}) which exploits the large redundancy between inter-state transitions in the automaton. They showed it to obtain up to two orders of magnitude compression of real-world DFAs, and their work formed the basis of numerous subsequent results. Their algorithm, as well as later algorithms based on their idea, have an inherent quadratic-time bottleneck, as they consider every pair of states to compute the optimal compression. In this work we present a simple, general framework based on locality-sensitive hashing for speeding up these algorithms to achieve sub-quadratic construction times for \ddfa{}s. We apply the framework to speed up several algorithms to near-linear time, and experimentally evaluate their performance on real-world regular expression sets extracted from modern intrusion detection systems. We find an order of magnitude improvement in compression times, with either little or no loss of compression, or even significantly better compression in some cases.
翻译:确定性有限自动机(DFA)是理论和实践中用于正则表达式高吞吐量匹配的经典工具。由于其空间消耗高,大量研究致力于DFA的压缩表示,同时仍支持高效的模式匹配查询。Kumar等人[SIGCOMM 2006]引入了延迟确定性有限自动机(D²FA),该自动机利用了自动机状态间转移的巨大冗余性。他们证明该方法可将实际DFA压缩高达两个数量级,其工作构成了众多后续成果的基础。他们的算法以及后续基于其思想的算法存在固有的二次时间瓶颈,因为需要考察每一对状态以计算最优压缩。本文提出一个基于局部敏感哈希的简单通用框架,用于加速这些算法,使D²FA的构建时间达到亚二次复杂度。我们将该框架应用于多种算法,使其加速至近线性时间,并在从现代入侵检测系统中提取的真实正则表达式集上实验评估其性能。我们发现压缩时间提升了一个数量级,且压缩率几乎无损失或完全无损失,某些情况下甚至实现了显著更优的压缩效果。