Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

翻译：软件工程智能体正越来越多地部署在可评估的工程环境中，然而故障后的恢复过程仍成本高昂、依赖人工且缺乏系统性。现有系统虽能暴露运行时迹或生成后续反馈，但未能将异构的运行时证据转化为有根基、有边界的恢复指导，以支持后续尝试。我们提出PROBE——一种面向软件工程智能体的故障锚定式结构化恢复框架。PROBE通过遥测层、诊断层和导控门，将故障运行遥测数据组织为结构化证据、结构化诊断以及有边界的恢复指导。遥测层保留细粒度运行时信号；诊断层融合跨信号证据以形成有根基的诊断；导控门则仅在证据充分、可操作且属于智能体侧行为范畴时，才生成基于诊断的指导。我们在三类场景中评估PROBE：仓库级软件修复、企业工作流恢复及AIOps服务缓解。在257个初始未解决案例中，PROBE达到65.37%的Top-1诊断准确率与21.79%的恢复率，分别较最强非PROBE基线提升43.58和12.45个百分点。结果揭示了诊断-恢复鸿沟：准确的诊断虽必要但不足，除非将其转化为后续尝试可执行与可验证的有边界指导。除受控评估外，Microsoft IcM原型表明，PROBE可作为非侵入式侧信道附着于现有服务诊断工作流，无需改变智能体策略、工具集或执行预算。这些结果表明，基于遥测、锚定故障的恢复方法可在现实工程约束下提升故障后的可恢复性。