AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
翻译:AI智能体正日益广泛地部署于执行重要任务。尽管在标准基准测试中准确率得分不断提升,显示出快速进展,但许多智能体在实践中仍持续出现故障。这种差异凸显了当前评估方法的一个根本局限:将智能体行为压缩为单一的成功指标,掩盖了关键的操作缺陷。值得注意的是,这种方法忽略了智能体在多次运行中是否表现一致、能否抵御扰动、是否可预测地失败,以及其错误严重性是否有界。基于安全关键工程领域的理念,我们通过提出十二个具体指标,从一致性、鲁棒性、可预测性和安全性四个关键维度分解智能体的可靠性,从而提供一个全面的性能画像。通过对14个智能体模型在两个互补的基准测试中进行评估,我们发现近期模型能力的提升仅带来了可靠性的微小改进。通过揭示这些持续存在的局限,我们的指标不仅补充了传统评估方法,同时为分析智能体的表现、性能衰退及故障模式提供了工具。