Logs provide valuable insights into system runtime and assist in software development and maintenance. Log parsing, which converts semi-structured log data into structured log data, is often the first step in automated log analysis. Given the wide range of log parsers utilizing diverse techniques, it is essential to evaluate them to understand their characteristics and performance. In this paper, we conduct a comprehensive empirical study comparing syntax- and semantic-based log parsers, as well as single-phase and two-phase parsing architectures. Our experiments reveal that semantic-based methods perform better at identifying the correct templates and syntax-based log parsers are 10 to 1,000 times more efficient and provide better grouping accuracy although they fall short in accurate template identification. Moreover, two-phase architecture consistently improves accuracy compared to single-phase architecture. Based on the findings of this study, we propose SynLog+, a template identification module that acts as the second phase in a two-phase log parsing architecture. SynLog+ improves the parsing accuracy of syntax-based and semantic-based log parsers by 236\% and 20\% on average, respectively, with virtually no additional runtime cost.
翻译:日志为系统运行时提供了有价值的洞察,并有助于软件开发和维护。日志解析将半结构化日志数据转换为结构化日志数据,通常是自动化日志分析的第一步。鉴于采用多种技术的日志解析器种类繁多,对其特性与性能进行评估至关重要。本文通过全面的实证研究,比较了基于语法和基于语义的日志解析器,以及单阶段与两阶段解析架构。实验结果表明,基于语义的方法在识别正确模板方面表现更优,而基于语法的日志解析器效率高出10至1000倍,且提供更好的分组准确性,尽管在精确模板识别方面存在不足。此外,两阶段架构相比单阶段架构能持续提升准确性。基于本研究的发现,我们提出了SynLog+,一个作为两阶段日志解析架构中第二阶段的模板识别模块。SynLog+在几乎不增加运行时成本的情况下,将基于语法和基于语义的日志解析器的解析准确率平均分别提高了236%和20%。