Due to the complexity and size of modern software systems, the amount of logs generated is tremendous. Hence, it is infeasible to manually investigate these data in a reasonable time, thereby requiring automating log analysis to derive insights about the functioning of the systems. Motivated by an industry use-case, we zoom-in on one integral part of automated log analysis, log parsing, which is the prerequisite to deriving any insights from logs. Our investigation reveals problematic aspects within the log parsing field, particularly its inefficiency in handling heterogeneous real-world logs. We show this by assessing the 14 most-recognized log parsing approaches in the literature using (i) nine publicly available datasets, (ii) one dataset comprised of combined publicly available data, and (iii) one dataset generated within the infrastructure of a large bank. Subsequently, toward improving log parsing robustness in real-world production scenarios, we propose a tool, Logchimera, that enables estimating log parsing performance in industry contexts through generating synthetic log data that resemble industry logs. Our contributions serve as a foundation to consolidate past research efforts, facilitate future research advancements, and establish a strong link between research and industry log parsing.
翻译:鉴于现代软件系统的复杂性与规模,系统运行时产生的日志量极其庞大。因此,在合理时间内人工分析这些数据已不可行,必须借助自动化日志分析来获取系统运行状况的洞察。受工业用例驱动,我们聚焦于自动化日志分析的核心环节——日志解析,这是从日志中提取任何洞察的前提。我们的研究揭示了日志解析领域存在的问题,特别是其处理异构真实世界日志的低效性。我们通过以下方式评估了文献中14种最受认可的日志解析方法:(i) 九个公开数据集,(ii) 一个由公开数据组合而成的数据集,以及(iii) 一个在大型银行基础设施中生成的数据集。随后,为提升日志解析在真实生产场景中的鲁棒性,我们提出Logchimera工具,该工具通过生成模拟工业日志的合成日志数据,能够评估工业环境下的日志解析性能。我们的贡献为整合既往研究成果、促进未来研究进展、以及建立研究与工业日志解析之间的强纽带奠定了基础。