Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics

Logs have been widely adopted in software system development and maintenance because of the rich runtime information they record. In recent years, the increase of software size and complexity leads to the rapid growth of the volume of logs. To handle these large volumes of logs efficiently and effectively, a line of research focuses on developing intelligent and automated log analysis techniques. However, only a few of these techniques have reached successful deployments in industry due to the lack of public log datasets and open benchmarking upon them. To fill this significant gap and facilitate more research on AI-driven log analytics, we have collected and released loghub, a large collection of system log datasets. In particular, loghub provides 19 real-world log datasets collected from a wide range of software systems, including distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software. In this paper, we summarize the statistics of these datasets, introduce some practical usage scenarios of the loghub datasets, and present our benchmarking results on loghub to benefit the researchers and practitioners in this field. Up to the time of this paper writing, the loghub datasets have been downloaded for roughly 90,000 times in total by hundreds of organizations from both industry and academia. The loghub datasets are available at https://github.com/logpai/loghub.

翻译：日志因其记录的丰富运行时信息，在软件系统开发与维护中得到广泛应用。近年来，软件规模和复杂性的增长导致日志量激增。为高效处理海量日志，研究人员聚焦于开发智能化、自动化的日志分析技术。然而，由于缺乏公开日志数据集及基于此的开放基准测试，仅有少数技术成功落地工业应用。为填补这一重大空白并促进更多基于AI的日志分析研究，我们收集并发布了loghub——一个大规模系统日志数据集集合。loghub特别收录了来自分布式系统、超级计算机、操作系统、移动系统、服务器应用及独立软件等19个真实软件系统的日志数据集。本文总结了这些数据集的统计特性，介绍了loghub数据集的实际应用场景，并展示了基于loghub的基准测试结果，旨在为该领域研究人员和实践者提供借鉴。截至本文撰写时，loghub数据集已被全球数百家工业界与学术界机构累计下载约9万次。该数据集可通过https://github.com/logpai/loghub 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日