Multi-agent systems (MAS) are increasingly capable of tackling complex real-world tasks, yet their reliance on inter-agent coordination, tool use, and long-horizon reasoning makes error recognition particularly challenging. Minor errors can propagate across agents, escalating into task failures while producing long, intertwined execution trajectories that impose significant costs for both human developers and automated systems to debug and analyze. Our key insight is that, despite surface differences in failure trajectories (e.g., logs), MAS errors often recur with similar structural patterns. This paper presents CORRECT, the first lightweight, training-free framework that leverages an online cache of distilled error schemata to recognize and transfer knowledge of failure structures across new requests. This cache-based reuse allows LLMs to perform targeted error localization at inference time, avoiding the need for expensive retraining while adapting to dynamic MAS deployments in subseconds. To support rigorous study in this domain, we also introduce CORRECT-Error, a large-scale dataset of over 2,000 annotated trajectories collected through a novel error-injection pipeline guided by real-world distributions, and further validated through human evaluation to ensure alignment with natural failure patterns. Experiments across seven diverse MAS applications show that CORRECT improves step-level error localization up to 19.8% over existing advances while at near-zero overhead, substantially narrowing the gap between automated and human-level error recognition.
翻译:多智能体系统(MAS)日益能够处理复杂的现实世界任务,但其对智能体间协调、工具使用和长程推理的依赖使得错误识别尤为困难。微小错误可能通过智能体间传播升级为任务失败,同时产生冗长交错的执行轨迹,这给人类开发者和自动化系统的调试与分析带来巨大成本。我们的核心洞见是:尽管失败轨迹(如日志)表面形态各异,但MAS错误常以相似结构模式重复出现。本文提出CORRECT——首个轻量级免训练框架,通过在线缓存蒸馏的错误模式库,实现跨新请求的失败结构知识识别与迁移。这种基于缓存的复用机制使大语言模型(LLM)能够在推理阶段进行精准错误定位,既避免了昂贵的重训练开销,又能毫秒级适应动态部署的MAS系统。为支撑该领域的系统性研究,我们还构建了CORRECT-Error数据集——基于真实世界错误分布驱动的创新性错误注入流程,收集超过2,000条标注轨迹,并通过人工评估确保其与自然失败模式的一致性。在七个多样化MAS应用中的实验表明,CORRECT在近乎零开销的条件下,将步骤级错误定位准确率较现有先进方法提升最高达19.8%,显著缩小了自动化错误识别与人类水平之间的差距。