Logs are crucial for analyzing large-scale software systems, offering insights into system health, performance, security threats, potential bugs, etc. However, their chaotic nature$\unicode{x2013}$characterized by sheer volume, lack of standards, and variability$\unicode{x2013}$makes manual analysis complex. The use of clustering algorithms can assist by grouping logs into a smaller set of templates, but lose the temporal and relational context in doing so. On the contrary, Large Language Models (LLMs) can provide meaningful explanations but struggle with processing large collections efficiently. Moreover, representation techniques for both approaches are typically limited to either plain text or traditional charting, especially when dealing with large-scale systems. In this paper, we combine clustering and LLM summarization with event detection and Multidimensional Scaling through the use of Time Curves to produce a holistic pipeline that enables efficient and automatic summarization of vast collections of software system logs. The core of our approach is the proposal of a semimetric distance that effectively measures similarity between events, thus enabling a meaningful representation. We show that our method can explain the main events of logs collected from different applications without prior knowledge. We also show how the approach can be used to detect general trends as well as outliers in parallel and distributed systems by overlapping multiple projections. As a result, we expect a significant reduction of the time required to analyze and resolve system-wide issues, identify performance bottlenecks and security risks, debug applications, etc.
翻译:日志对于分析大规模软件系统至关重要,能够提供系统健康状况、性能表现、安全威胁及潜在缺陷等方面的洞察。然而,其固有的混沌特性——包括海量数据、缺乏标准以及高度可变性——使得人工分析极为复杂。聚类算法可通过将日志归纳为少量模板来提供协助,但在此过程中会丢失时序与关联上下文信息。相反,大型语言模型(LLMs)虽能提供有意义的解释,却难以高效处理大规模日志集合。此外,这两种方法的表征技术通常局限于纯文本或传统图表形式,在处理大规模系统时尤其如此。本文通过时间曲线的应用,将聚类算法与LLM摘要生成技术同事件检测及多维标度分析相结合,构建了一个完整的处理流程,能够实现对海量软件系统日志的高效自动化摘要。我们方法的核心在于提出一种半度量距离,该距离能有效度量事件间相似性,从而实现具有语义意义的表征。实验表明,我们的方法无需先验知识即可解释来自不同应用程序的日志核心事件。我们还展示了如何通过叠加多重投影,利用该方法检测并行与分布式系统中的总体趋势及异常值。最终,我们预期该方法将显著缩短系统级问题分析与解决、性能瓶颈与安全风险识别以及应用程序调试等方面所需的时间。