Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80\% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. We will publicly release MT-bench questions, 3K expert votes, and 30K conversations with human preferences from Chatbot Arena.
翻译:评估基于大语言模型(LLM)的聊天助手具有挑战性,因其能力广泛且现有基准难以衡量人类偏好。为此,我们探索使用强LLM作为评判者,在更开放的问题上评估这些模型。我们研究了LLM作为评判者的使用与局限性,包括位置偏差、冗长偏差、自我增强偏差以及有限的推理能力,并提出缓解部分问题的方案。随后,通过引入两个基准:多轮问答数据集MT-Bench和众包对战平台Chatbot Arena,我们验证了LLM评判者与人类偏好的一致性。结果表明,GPT-4等强LLM评判者能良好匹配受控环境和众包环境下的人类偏好,达成超过80%的一致率,与人类之间的一致性水平相当。因此,LLM作为评判者是一种可扩展且可解释的近似人类偏好的方法,而直接获取人类偏好成本极高。此外,通过评估LLaMA和Vicuna的多个变体,我们证明本基准与传统基准具有互补性。我们将公开发布MT-Bench问题集、3000份专家投票以及来自Chatbot Arena的3万段含人类偏好的对话。