In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results in a speedup of up to 14x compared to state-of-the-art regex processing engines that do not use indexing, using only 2.1% of extra space. We perform a detailed study that analyzes the space usage of the index and the improvement in workload execution time, uncovering interesting insights. Specifically, we show that even an optimized implementation of strategies such as inverted indexing, which are widely used in text processing libraries, may lead to suboptimal performance for regex indexing on log analysis tasks. Overall, the REI approach presented in this paper provides a significant boost when evaluating regular expression queries on log data. REI is also modular and can work with existing regular expression packages, making it easy to deploy in a variety of settings. The code of REI is available at https://github.com/mush-zhang/REI-Regular-Expression-Indexing.
翻译:本文介绍了REI的设计与架构,这是一种用于索引日志数据以支持正则表达式查询的新型系统。我们的主要贡献在于提出了一种基于n-gram的索引策略及高效存储机制,相比未使用索引的先进正则表达式处理引擎,该系统在仅增加2.1%额外存储空间的情况下实现了最高14倍的加速。我们通过详细研究分析了索引的空间使用情况与工作负载执行时间的改进,并揭示了若干重要发现。具体而言,我们证明了即使在文本处理库中广泛应用的倒排索引等策略经过优化实现,在日志分析任务的正则表达式索引场景中仍可能导致次优性能。总体而言,本文提出的REI方法在评估日志数据的正则表达式查询时能带来显著性能提升。REI采用模块化设计,可与现有正则表达式工具包协同工作,便于在各种场景中部署。REI的代码已发布于https://github.com/mush-zhang/REI-Regular-Expression-Indexing。