Modern software systems produce vast amounts of logs, serving as an essential resource for anomaly detection. Artificial Intelligence for IT Operations (AIOps) tools have been developed to automate the process of log-based anomaly detection for software systems. Three practical challenges are widely recognized in this field: high data labeling costs, evolving logs in dynamic systems, and adaptability across different systems. In this paper, we propose CroSysLog, an AIOps tool for log-event level anomaly detection, specifically designed in response to these challenges. Following prior approaches, CroSysLog uses a neural representation approach to gain a nuanced understanding of logs and generate representations for individual log events accordingly. CroSysLog can be trained on source systems with sufficient labeled logs from open datasets to achieve robustness, and then efficiently adapt to target systems with a few labeled log events for effective anomaly detection. We evaluate CroSysLog using open datasets of four large-scale distributed supercomputing systems: BGL, Thunderbird, Liberty, and Spirit. We used random log splits, maintaining the chronological order of consecutive log events, from these systems to train and evaluate CroSysLog. These splits were widely distributed across a one/two-year span of each system's log collection duration, thereby capturing the evolving nature of the logs in each system. Our results show that, after training CroSysLog on Liberty and BGL as source systems, CroSysLog can efficiently adapt to target systems Thunderbird and Spirit using a few labeled log events from each target system, effectively performing anomaly detection for these target systems. The results demonstrate that CroSysLog is a practical, scalable, and adaptable tool for log-event level anomaly detection in operational and maintenance contexts of software systems.
翻译:现代软件系统产生海量日志,这些日志是异常检测的重要资源。人工智能运维工具已被开发用于自动化软件系统的基于日志的异常检测过程。该领域普遍存在三个实际挑战:高昂的数据标注成本、动态系统中日志的不断演变以及跨不同系统的适应性。本文提出CrosSysLog,一种专为应对这些挑战而设计的、用于日志事件级异常检测的人工智能运维工具。遵循先前方法,CrosSysLog采用神经表示方法来深入理解日志,并据此为单个日志事件生成表示。CrosSysLog可在拥有充足标注日志的开放数据集源系统上进行训练以获得鲁棒性,随后仅需目标系统的少量标注日志事件即可高效适应,实现有效的异常检测。我们使用四个大规模分布式超级计算系统的开放数据集评估CrosSysLog:BGL、Thunderbird、Liberty和Spirit。我们从这些系统中采用随机日志分割(保持连续日志事件的时间顺序)来训练和评估CrosSysLog。这些分割广泛分布于每个系统日志收集时长的一至两年跨度内,从而捕捉各系统日志的演变特性。实验结果表明,在Liberty和BGL作为源系统训练后,CrosSysLog仅需每个目标系统(Thunderbird和Spirit)的少量标注日志事件即可高效适应,并有效执行这些目标系统的异常检测。结果证明CrosSysLog是一种实用、可扩展且适应性强的工具,适用于软件系统运维场景中的日志事件级异常检测。