From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

翻译：基于大语言模型的智能体日益依赖提供执行环境、工具接口、上下文、生命周期编排、可观测性、验证与治理的框架。现有自我改进型智能体与自动框架进化方法主要通过运行时监督、提示优化、工作流搜索或基于最终结果的框架调整来改进智能体。然而，这些方法往往无法诊断失败轨迹中责任证据的分布位置及导致不可靠行为的框架层级，从而导致变更范围宽泛、间接或界定不清。本文提出HarnessFix——一种面向智能体失败诊断与框架修复的轨迹引导框架。HarnessFix将原始执行轨迹与框架代码编译为框架感知轨迹中间表示（HTIR），该表示可规范化碎片化的轨迹证据，并捕获步骤级溯源关系与控制流关系。随后，它将失败归因至相关轨迹步骤与框架层级，将重复出现的诊断结果整合为可操作的缺陷记录，并将其映射至界定清晰的修复算子。最终，HarnessFix在缺陷特定修复规格下生成并验证框架补丁，以减少目标缺陷且不引入不可接受的性能回归。我们在SWE-Bench Verified、Terminal-Bench 2.0 Verified、GAIA和AppWorld上对HarnessFix进行评估。在这些基准测试中，HarnessFix相比初始框架将留出测试性能提升15.2%至50.0%，超越人工设计及自我进化基线，并揭示了ETCLOVG层中重复出现的框架缺陷模式。