AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 15 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
翻译:人工智能体正被越来越多地部署以执行重要任务。尽管标准基准测试上的准确率分数不断提高,似乎表明进展迅速,但许多智能体在实践中仍然持续失效。这一差异凸显了当前评估的一个根本性局限:将智能体行为压缩为单一成功指标会掩盖关键的运行缺陷。值得注意的是,它忽略了智能体是否在多次运行中表现一致、能否承受扰动、是否可预测地失败,以及错误严重性是否可控。基于安全关键工程原理,我们提出了一套整体性能轮廓,通过提出十二项具体指标,将智能体可靠性分解为四个关键维度:一致性、鲁棒性、可预测性和安全性。在两个互补基准测试中评估了15个模型后,我们发现近期的能力提升仅在可靠性上带来了微小的改进。通过揭示这些持续存在的局限性,我们的指标在提供推理智能体如何表现、退化和失效的工具的同时,对传统评估形成了补充。