The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.
翻译:大语言模型(LLMs)的快速发展推动了其在各个领域的广泛应用。其中最具前景的应用之一,是将其作为基于自然语言响应的评估者,即"LLMs-as-judges"。该框架因其出色的评估效果、跨任务的泛化能力以及以自然语言形式呈现的可解释性,正受到学术界和工业界越来越多的关注。本文从五个关键视角对LLMs-as-judges范式进行了全面综述:功能、方法、应用、元评估与局限性。我们首先系统性地定义了LLMs-as-Judges,并介绍了其功能(为何使用LLM作为评估者?)。接着,我们阐述了如何利用LLMs构建评估系统的方法论(如何使用LLM作为评估者?)。此外,我们探讨了其潜在的应用领域(在何处使用LLM作为评估者?),并讨论了在不同情境下评估这些LLM评估者自身的方法(如何评估LLM评估者?)。最后,我们详细分析了LLM评估者的局限性,并讨论了潜在的未来发展方向。通过结构化和全面的分析,我们旨在为LLMs-as-judges在研究和实践中的开发与应用提供见解。我们将持续维护相关资源列表于 https://github.com/CSHaitao/Awesome-LLMs-as-Judges。