Logs have been widely adopted in software system development and maintenance because of the rich runtime information they record. In recent years, the increase of software size and complexity leads to the rapid growth of the volume of logs. To handle these large volumes of logs efficiently and effectively, a line of research focuses on developing intelligent and automated log analysis techniques. However, only a few of these techniques have reached successful deployments in industry due to the lack of public log datasets and open benchmarking upon them. To fill this significant gap and facilitate more research on AI-driven log analytics, we have collected and released loghub, a large collection of system log datasets. In particular, loghub provides 19 real-world log datasets collected from a wide range of software systems, including distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software. In this paper, we summarize the statistics of these datasets, introduce some practical usage scenarios of the loghub datasets, and present our benchmarking results on loghub to benefit the researchers and practitioners in this field. Up to the time of this paper writing, the loghub datasets have been downloaded for roughly 90,000 times in total by hundreds of organizations from both industry and academia. The loghub datasets are available at https://github.com/logpai/loghub.
翻译:日志因其记录的丰富运行时信息,在软件系统开发与维护中得到广泛应用。近年来,软件规模和复杂性的增长导致日志量激增。为高效处理海量日志,研究人员聚焦于开发智能化、自动化的日志分析技术。然而,由于缺乏公开日志数据集及基于此的开放基准测试,仅有少数技术成功落地工业应用。为填补这一重大空白并促进更多基于AI的日志分析研究,我们收集并发布了loghub——一个大规模系统日志数据集集合。loghub特别收录了来自分布式系统、超级计算机、操作系统、移动系统、服务器应用及独立软件等19个真实软件系统的日志数据集。本文总结了这些数据集的统计特性,介绍了loghub数据集的实际应用场景,并展示了基于loghub的基准测试结果,旨在为该领域研究人员和实践者提供借鉴。截至本文撰写时,loghub数据集已被全球数百家工业界与学术界机构累计下载约9万次。该数据集可通过https://github.com/logpai/loghub 获取。