We revisit the popular \emph{delayed deterministic finite automaton} (\ddfa{}) compression algorithm introduced by Kumar~et~al.~[SIGCOMM 2006] for compressing deterministic finite automata (DFAs) used in intrusion detection systems. This compression scheme exploits similarities in the outgoing sets of transitions among states to achieve strong compression while maintaining high throughput for matching. The \ddfa{} algorithm and later variants of it, unfortunately, require at least quadratic compression time since they compare all pairs of states to compute an optimal compression. This is too slow and, in some cases, even infeasible for collections of regular expression in modern intrusion detection systems that produce DFAs of millions of states. Our main result is a simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time. We apply our approach to the original \ddfa{} compression algorithm and two important variants, and we experimentally evaluate our algorithms on DFAs from widely used modern intrusion detection systems. Overall, our new algorithms compress up to an order of magnitude faster than existing solutions with either no or little loss of compression size. Consequently, our algorithms are significantly more scalable and can handle larger collections of regular expressions than previous solutions.
翻译:我们重新审视了由Kumar等人[SIGCOMM 2006]提出的、用于入侵检测系统中确定性有限自动机(DFA)压缩的流行算法——延迟确定性有限自动机(\ddfa{})压缩算法。该压缩方案通过利用状态间转移出边集合的相似性,在保持高匹配吞吐量的同时实现强压缩效果。然而,\ddfa{}算法及其后续变体需要至少二次方的压缩时间,因为它们需比较所有状态对以计算最优压缩。对于现代入侵检测系统中生成包含数百万状态的DFA的正则表达式集合而言,这种速度过慢,在某些情况下甚至不可行。我们的主要成果是提出一个基于局部敏感哈希的简单通用框架,用于构建近似最优的\ddfa{},其时间复杂度接近线性。我们将该方法应用于原始\ddfa{}压缩算法及两个重要变体,并在广泛使用的现代入侵检测系统生成的DFA上进行了实验评估。总体而言,我们的新算法在压缩速度上比现有解决方案提升了一个数量级,且压缩率基本保持不变。因此,我们的算法具有显著更高的可扩展性,能够处理比以往解决方案更大规模的正则表达式集合。