Towards a Science of AI Agent Reliability

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

翻译：AI智能体正日益广泛地部署以执行重要任务。尽管标准基准测试中不断上升的准确率分数显示出快速进展，但许多智能体在实践中仍持续失败。这种差异凸显了当前评估方法的一个根本局限：将智能体行为压缩为单一成功指标会掩盖关键的操作缺陷。值得注意的是，它忽略了智能体在多次运行中是否表现一致、能否承受扰动、是否可预测地失败，以及错误严重性是否有界。基于安全关键工程领域的理念，我们通过提出十二个具体指标，从一致性、鲁棒性、可预测性和安全性四个关键维度分解智能体可靠性，从而提供全面的性能画像。通过对14个模型在两个互补基准测试中的评估，我们发现近期能力提升仅带来可靠性的微小改进。通过揭示这些持续存在的局限，我们的指标在补充传统评估方法的同时，为推理智能体如何运行、性能衰退及失效提供了分析工具。

相关内容

关注 7104

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

伯克利最新《智能体 AI (Agentic AI)》课程

专知会员服务

47+阅读 · 3月1日

智能体 AI (Agentic AI) 的新进展：回归初心，预见未来

专知会员服务

28+阅读 · 1月2日

智能体工程（Agent Engineering）

专知会员服务

33+阅读 · 2025年12月31日

迈向智能体系统规模化的科学

专知会员服务

22+阅读 · 2025年12月12日