Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale.
翻译:基于大型语言模型(LLM)驱动的自主编码智能体正日益被软件行业采用,以自动化复杂的工程任务。然而,这些智能体容易出现多种行为异常,例如偏离用户指令、陷入重复循环或未能正确使用工具。这些故障会中断开发流程,并常常需要资源密集的人工干预。本文提出一种大规模自动恢复智能体行为异常的系统。我们首先基于对生产流量的分析构建了行为异常的分类体系,识别出三大主要类别:规范漂移、推理问题与工具调用失败,这三类异常在所有智能体轨迹中的发生率约为30%。为解决这些问题,我们开发了一个轻量级异步自干预系统Wink。该系统通过观测智能体轨迹,提供针对性的航向修正指导,引导智能体回归有效路径。我们在超过10,000条真实世界智能体轨迹上评估了该系统,发现其能成功解决90%仅需单次干预的行为异常。此外,生产环境中的实时A/B测试表明,本系统显著降低了工具调用失败率、单会话令牌消耗量及单会话工程师干预次数(统计显著性)。我们分享了该系统的设计与部署经验,为构建大规模鲁棒性智能体系统提供了实践洞见。