Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.
翻译:评估已不再是机器学习生命周期中的最终检查点。随着人工智能系统从静态模型演变为复合型、工具使用的智能体,评估已成为核心控制功能。问题不再是"模型性能如何?",而是"我们能否信任系统在变化和规模化条件下按预期运行?"。然而,大多数评估实践仍固守源自模型中心时代的假设:静态基准测试、综合评分和一次性成功标准。本文论证此类方法日益模糊而非阐明系统行为。我们探讨评估流程本身如何引入隐性故障模式,为何高基准分数经常误导团队,以及智能体系统如何从根本上改变性能测量的意义。我们并非提出新指标或更难的基准测试,而是旨在阐明AI时代评估的作用,特别是针对智能体:评估不应是性能表演,而应成为在非确定性系统中建立信任、迭代和治理的测量准则。