We revisit the popular \emph{delayed deterministic finite automaton} (\ddfa{}) compression algorithm introduced by Kumar~et~al.~[SIGCOMM 2006] for compressing deterministic finite automata (DFAs) used in intrusion detection systems. This compression scheme exploits similarities in the outgoing sets of transitions among states to achieve strong compression while maintaining high throughput for matching. The \ddfa{} algorithm and later variants of it, unfortunately, require at least quadratic compression time since they compare all pairs of states to compute an optimal compression. This is too slow and, in some cases, even infeasible for collections of regular expression in modern intrusion detection systems that produce DFAs of millions of states. Our main result is a simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time. We apply our approach to the original \ddfa{} compression algorithm and two important variants, and we experimentally evaluate our algorithms on DFAs from widely used modern intrusion detection systems. Overall, our new algorithms compress up to an order of magnitude faster than existing solutions with either no or little loss of compression size. Consequently, our algorithms are significantly more scalable and can handle larger collections of regular expressions than previous solutions.
翻译:我们重新审视了由Kumar等人[SIGCOMM 2006]提出的、用于入侵检测系统中确定性有限自动机压缩的经典算法——延迟确定性有限自动机压缩算法。该压缩方案通过利用状态间转移出边集合的相似性,在保持高速匹配吞吐量的同时实现强压缩效果。然而,\ddfa{}算法及其后续变体需要至少二次方的压缩时间,因为它们需比较所有状态对以计算最优压缩。对于现代入侵检测系统中可产生数百万状态DFA的正则表达式集合,这种速度过慢,甚至在某些情况下不可行。我们的核心贡献是一个基于局部敏感哈希的简单通用框架,可在近线性时间内构建近似最优的\ddfa{}。我们将该方法应用于原始\ddfa{}压缩算法及两种重要变体,并在广泛使用的现代入侵检测系统生成的DFA上进行了实验评估。总体而言,新算法在压缩速度上比现有方案提升达一个数量级,且压缩率基本保持不变。因此,我们的算法具有显著更好的可扩展性,能够处理比以往方案更大规模的正则表达式集合。