Preprocessing is All You Need: Boosting the Performance of Log Parsers With a General Preprocessing Framework

Log parsing has been a long-studied area in software engineering due to its importance in identifying dynamic variables and constructing log templates. Prior work has proposed many statistic-based log parsers (e.g., Drain), which are highly efficient; they, unfortunately, met the bottleneck of parsing performance in comparison to semantic-based log parsers, which require labeling and more computational resources. Meanwhile, we noticed that previous studies mainly focused on parsing and often treated preprocessing as an ad hoc step (e.g., masking numbers). However, we argue that both preprocessing and parsing are essential for log parsers to identify dynamic variables: the lack of understanding of preprocessing may hinder the optimal use of parsers and future research. Therefore, our work studied existing log preprocessing approaches based on Loghub, a popular log parsing benchmark. We developed a general preprocessing framework with our findings and evaluated its impact on existing parsers. Our experiments show that the preprocessing framework significantly boosts the performance of four state-of-the-art statistic-based parsers. Drain, the best statistic-based parser, obtained improvements across all four parsing metrics (e.g., F1 score of template accuracy, FTA, increased by 108.9%). Compared to semantic-based parsers, it achieved a 28.3% improvement in grouping accuracy (GA), 38.1% in FGA, and an 18.6% increase in FTA. Our work pioneers log preprocessing and provides a generalizable framework to enhance log parsing.

翻译：日志解析因其在识别动态变量和构建日志模板方面的重要性，一直是软件工程领域长期研究的课题。先前的研究提出了许多基于统计的日志解析器（例如Drain），这些解析器效率极高；然而，与需要标注和更多计算资源的基于语义的日志解析器相比，它们在解析性能上遇到了瓶颈。同时，我们注意到以往的研究主要集中在解析上，通常将预处理视为临时步骤（例如掩码数字）。但我们认为，预处理和解析对于日志解析器识别动态变量都至关重要：对预处理理解的不足可能会阻碍解析器的最佳使用和未来研究。因此，本研究基于流行的日志解析基准Loghub，对现有的日志预处理方法进行了系统研究。我们根据研究发现开发了一个通用预处理框架，并评估了其对现有解析器的影响。实验结果表明，该预处理框架显著提升了四种最先进的基于统计的解析器的性能。其中最佳的基于统计的解析器Drain在所有四项解析指标上均获得提升（例如，模板准确率的F1分数FTA提高了108.9%）。与基于语义的解析器相比，其在分组准确率（GA）上提升了28.3%，在FGA上提升了38.1%，在FTA上提升了18.6%。本研究开创了日志预处理领域，并提供了一个可推广的框架以增强日志解析能力。