We revisit the popular \emph{delayed deterministic finite automaton} (\ddfa{}) compression algorithm introduced by Kumar~et~al.~[SIGCOMM 2006] for compressing deterministic finite automata (DFAs) used in intrusion detection systems. This compression scheme exploits similarities in the outgoing sets of transitions among states to achieve strong compression while maintaining high throughput for matching. The \ddfa{} algorithm and later variants of it, unfortunately, require at least quadratic compression time since they compare all pairs of states to compute an optimal compression. This is too slow and, in some cases, even infeasible for collections of regular expression in modern intrusion detection systems that produce DFAs of millions of states. Our main result is a simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time. We apply our approach to the original \ddfa{} compression algorithm and two important variants, and we experimentally evaluate our algorithms on DFAs from widely used modern intrusion detection systems. Overall, our new algorithms compress up to an order of magnitude faster than existing solutions with either no or little loss of compression size. Consequently, our algorithms are significantly more scalable and can handle larger collections of regular expressions than previous solutions.
翻译:我们重新审视了Kumar等人[SIGCOMM 2006]为压缩入侵检测系统中使用的确定有限自动机(DFA)而提出的流行《延迟确定有限自动机》(\ddfa{})压缩算法。该压缩方案利用状态间输出转移集合的相似性,在实现强压缩的同时保持高吞吐量匹配性能。遗憾的是,\ddfa{}算法及其后续变体因需比较所有状态对以计算最优压缩,至少需要二次压缩时间。这对现代入侵检测系统中产生数百万状态DFA的正则表达式集合而言过于缓慢,甚至在某些情况下不可行。我们的主要成果是一个基于局部敏感哈希的简单通用框架,能以近线性时间构建近似最优的\ddfa{}。我们将该方法应用于原始\ddfa{}压缩算法及其两个重要变体,并在现代广泛应用的入侵检测系统所生成的DFA上进行了实验评估。总体而言,我们的新算法压缩速度比现有方案快一个数量级,且压缩率几乎没有损失或损失极小。因此,我们的算法具有显著更强的可扩展性,能处理比先前方案更大规模的正则表达式集合。