Agentic systems have been widely studied to automate software engineering jobs such as bug fixing. As these systems increasingly tackle complex tasks, understanding where and why they fail becomes essential for iterative refinement and operational reliability. Existing automated failure diagnosis approaches leverage task execution trajectories, yet their effectiveness degrades substantially as trajectory length and complexity increase. For repository-level coding tasks specifically, trajectories are laden with noise, such as redundant program structure and verbose code context. Moreover, these trajectories are very long, while long-context reasoning remains a known weakness of LLMs. To address these two challenges, we propose TrajAudit, the first failure diagnosis framework for repository-level coding trajectories. TrajAudit employs an investigator agent supported by two modules: one filters failure-irrelevant information through pattern matching and keyword detection, and the other generates a preliminary diagnosis from test failure reports as prior knowledge, helping the agent handle noisy long contexts. The investigator agent can further invoke tools to retrieve filtered content on demand, ensuring that critical information is preserved while noise is minimized. We also introduce RootSE, a benchmark of 93 real-world agentic failure instances sourced from software maintenance tasks, representing the most complex trajectory diagnosis benchmark to date. Experiments on RootSE show that TrajAudit outperforms all existing baselines by over 24.4 percentage points in localization accuracy, while reducing token consumption by at least 18%, demonstrating its practical effectiveness. We hope this work draws community attention to failure management in agentic software engineering and provides a foundational resource for future research.
翻译:智能体系统已被广泛研究,用于自动化软件工程任务(如漏洞修复)。随着这些系统日益处理复杂任务,理解其故障位置及原因对于迭代优化和运行可靠性至关重要。现有的自动化故障诊断方法利用任务执行轨迹,但其有效性会随着轨迹长度和复杂度的增加而显著下降。具体而言,针对仓库级编码任务,轨迹中充斥着噪声(如冗余的程序结构和冗长的代码上下文)。此外,这类轨迹非常长,而长上下文推理仍然是大型语言模型(LLMs)的已知弱点。为应对这两项挑战,我们提出TrajAudit,这是首个针对仓库级编码轨迹的故障诊断框架。TrajAudit采用一个调查智能体,由两个模块支撑:一个模块通过模式匹配和关键词检测过滤故障无关信息,另一个模块从测试失败报告中生成初步诊断作为先验知识,帮助智能体处理含噪的长上下文。该调查智能体还可按需调用工具检索过滤后的内容,确保关键信息保留的同时最小化噪声。我们还引入RootSE——一个包含93个源自软件维护任务的真实智能体故障实例的基准测试,是迄今最复杂的轨迹诊断基准。在RootSE上的实验表明,TrajAudit在定位精度上超越所有现有基线方法至少24.4个百分点,同时减少至少18%的令牌消耗,证明了其实用有效性。我们希望这项工作能引起社区对智能体软件工程中故障管理的关注,并为未来研究提供基础资源。