Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.
翻译:准确且一致的评估对于众多领域的决策至关重要,但由于其固有的主观性、可变性和规模,这仍然是一项具有挑战性的任务。大型语言模型(LLMs)在多个领域取得了显著成功,这催生了"LLM-as-a-Judge"的出现,即利用LLMs作为复杂任务的评估者。凭借其处理多样化数据类型的能力,并能提供可扩展、经济高效且一致的评估,LLMs为传统的专家驱动式评估提供了一个极具吸引力的替代方案。然而,确保LLM-as-a-Judge系统的可靠性仍然是一个重大挑战,需要精心的设计和标准化。本文对LLM-as-a-Judge进行了全面的综述,旨在回答一个核心问题:如何构建可靠的LLM-as-a-Judge系统?我们探讨了提升可靠性的策略,包括提高一致性、减轻偏见以及适应多样化的评估场景。此外,我们提出了评估LLM-as-a-Judge系统可靠性的方法论,并为此设计了一个新颖的基准测试作为支撑。为了推动LLM-as-a-Judge系统的开发和实际部署,我们还讨论了其实际应用、面临的挑战以及未来方向。本综述为这一快速发展领域的研究人员和实践者提供了基础性参考。