Execution traces are a critical source of information for understanding, debugging, and optimizing complex software systems. However, traces from OS kernels or large-scale applications like Chrome or MySQL are massive and difficult to analyze. Existing tools rely on predefined analyses, and custom insights often require writing domain-specific scripts, which is an error-prone and time-consuming task. This paper introduces TAAF (Trace Abstraction and Analysis Framework), a novel approach that combines time-indexing, knowledge graphs (KGs), and large language models (LLMs) to transform raw trace data into actionable insights. TAAF constructs a time-indexed KG from trace events to capture relationships among entities such as threads, CPUs, and system resources. An LLM then interprets query-specific subgraphs to answer natural-language questions, reducing the need for manual inspection and deep system expertise. To evaluate TAAF, we introduce TraceQA-100, a benchmark of 100 questions grounded in real kernel traces. Experiments across three LLMs and multiple temporal settings show that TAAF improves answer accuracy by up to 31.2%, particularly in multi-hop and causal reasoning tasks. We further analyze where graph-grounded reasoning helps and where limitations remain, offering a foundation for next-generation trace analysis tools.
翻译:执行轨迹是理解、调试和优化复杂软件系统的关键信息来源。然而,来自操作系统内核或大规模应用程序(如Chrome或MySQL)的轨迹数据量巨大且难以分析。现有工具依赖于预定义的分析方法,而获取定制化洞察通常需要编写领域特定脚本,这是一项易出错且耗时的任务。本文介绍了TAAF(轨迹抽象与分析框架),这是一种结合时间索引、知识图谱和大型语言模型的新方法,旨在将原始轨迹数据转化为可操作的洞察。TAAF从轨迹事件构建时间索引的知识图谱,以捕获线程、CPU和系统资源等实体之间的关系。随后,大型语言模型通过解释查询特定的子图来回答自然语言问题,从而减少人工检查的需求并降低对深度系统专业知识的要求。为评估TAAF,我们提出了TraceQA-100基准测试,包含基于真实内核轨迹的100个问题。在三种大型语言模型和多种时间设置下的实验表明,TAAF将答案准确率最高提升了31.2%,尤其在多跳推理和因果推理任务中表现突出。我们进一步分析了基于图谱的推理在哪些场景中有效以及仍存在哪些局限性,为下一代轨迹分析工具奠定了基础。