LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
翻译:LLM即裁判通过利用大型语言模型进行可扩展的评估,彻底改变了人工智能评估领域。然而,随着评估对象日益复杂化、专业化和多步骤化,LLM即裁判的可靠性受到其固有偏见、浅层单次推理以及无法对照现实世界观察验证评估结果的限制。这推动了向智能体即裁判的范式转变:智能体裁判通过规划、工具增强验证、多智能体协作和持久记忆等技术,实现更鲁棒、可验证且细致的评估。尽管智能体评估系统正快速涌现,该领域仍缺乏统一的框架来应对这一变革格局。为弥合此鸿沟,我们首次提出追踪这一演进的系统性综述。具体而言,我们识别了表征该范式转变的关键维度,并建立了发展分类体系。我们梳理了核心方法论,并综述了其在通用领域与专业领域的应用。此外,我们分析了前沿挑战并指明了有前景的研究方向,最终为下一代智能体评估提供了清晰的路线图。