A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?

Log data have facilitated various tasks of software development and maintenance, such as testing, debugging and diagnosing. Due to the unstructured nature of logs, log parsing is typically required to transform log messages into structured data for automated log analysis. Given the abundance of log parsers that employ various techniques, evaluating these tools to comprehend their characteristics and performance becomes imperative. Loghub serves as a commonly used dataset for benchmarking log parsers, but it suffers from limited scale and representativeness, posing significant challenges for studies to comprehensively evaluate existing log parsers or develop new methods. This limitation is particularly pronounced when assessing these log parsers for production use. To address these limitations, we provide a new collection of annotated log datasets, denoted Loghub-2.0, which can better reflect the characteristics of log data in real-world software systems. Loghub-2.0 comprises 14 datasets with an average of 3.6 million log lines in each dataset. Based on Loghub-2.0, we conduct a thorough re-evaluation of 15 state-of-the-art log parsers in a more rigorous and practical setting. Particularly, we introduce a new evaluation metric to mitigate the sensitivity of existing metrics to imbalanced data distributions. We are also the first to investigate the granular performance of log parsers on logs that represent rare system events, offering in-depth details for software diagnosis. Accurately parsing such logs is essential, yet it remains a challenge. We believe this work could shed light on the evaluation and design of log parsers in practical settings, thereby facilitating their deployment in production systems.

翻译：日志数据支撑了软件开发和维护中的多项任务，如测试、调试和诊断。由于日志的非结构化特性，通常需要日志解析将日志消息转换为结构化数据，以便进行自动化日志分析。鉴于存在大量采用不同技术的日志解析器，评估这些工具以理解其特性和性能变得至关重要。Loghub是常用的日志解析器基准数据集，但其规模和代表性有限，给全面评估现有日志解析器或开发新方法的研究带来重大挑战，特别是在评估生产环境使用的日志解析器时尤为突出。为解决这些局限，我们提供了新的标注日志数据集集合Loghub-2.0，该集合能更好地反映真实软件系统中日志数据的特性。Loghub-2.0包含14个数据集，每个数据集平均有360万行日志。基于Loghub-2.0，我们在更严格和实际的环境下对15种最先进的日志解析器进行了彻底重新评估。具体而言，我们引入新的评估指标以减轻现有指标对不平衡数据分布的敏感性。我们首次研究了日志解析器在代表罕见系统事件的日志上的粒度性能，为软件诊断提供了深入细节。准确解析此类日志至关重要，但仍是一大挑战。我们相信，这项工作能为实际场景下日志解析器的评估与设计提供启示，从而推动其在生产系统中的部署。