AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
翻译:AI智能体正日益广泛地部署以执行重要任务。尽管标准基准测试中不断上升的准确率分数显示出快速进展,但许多智能体在实践中仍持续失败。这种差异凸显了当前评估方法的一个根本局限:将智能体行为压缩为单一成功指标会掩盖关键的操作缺陷。值得注意的是,它忽略了智能体在多次运行中是否表现一致、能否承受扰动、是否可预测地失败,以及错误严重性是否有界。基于安全关键工程领域的理念,我们通过提出十二个具体指标,从一致性、鲁棒性、可预测性和安全性四个关键维度分解智能体可靠性,从而提供全面的性能画像。通过对14个模型在两个互补基准测试中的评估,我们发现近期能力提升仅带来可靠性的微小改进。通过揭示这些持续存在的局限,我们的指标在补充传统评估方法的同时,为推理智能体如何运行、性能衰退及失效提供了分析工具。