Failure-aware observability diagnoses wasted computation in multi-agent LLM systems before final-answer evaluation can explain what went wrong. We propose a trace-based framework for a three-agent architecture -- orchestrator, search agent, and execution agent -- that converts structured events into online signals for loops, budget pressure, low information gain, and tool instability, then adds offline semantic grounding metrics and selective LLM-as-judge evaluation. On 165 GAIA validation traces under identical caps, 98 runs produce usable final answers and 67 fail or stop without one. Among warned failed runs, 58.1% of tokens are spent after the first warning on average, indicating substantial opportunity for intervention. A 10-task Level-2 pilot uses warnings to diversify search or require evidence, reducing post-warning token fraction from 0.638 in the baseline to 0.304. The results support a layered design: cheap online signals help the orchestrator redirect or halt redundant behavior, while deeper semantic checks identify whether completed answers are grounded enough to trust.
翻译:故障感知观测通过在最终答案评估之前诊断多智能体大语言模型系统中的浪费计算,并解释出错原因。我们提出了一种基于追踪的框架,针对由协调器、搜索智能体和执行智能体组成的三智能体架构,将结构化事件转换为在线信号,用于检测循环、预算压力、低信息增益和工具不稳定性,并补充了离线语义基础度量及选择性LLM-as-judge评估。在一致的预算限制下,对165条GAIA验证追踪进行测试,其中98次运行产生了可用最终答案,67次运行失败或中途终止。在发出警告的失败运行中,警告后平均消耗了58.1%的令牌,这表明存在显著的干预机会。一项包含10个任务的二级试点实验,利用警告来引导多样化搜索或要求提供证据,将基线中警告后的令牌占比从0.638降至0.304。这些结果支持分层设计:低成本的在线信号帮助协调器重定向或终止冗余行为,而更深入的语义检查则用于判断已完成的答案是否具备足够的可信基础。