Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.
翻译:大语言模型(LLM)解锁了新的能力与应用场景,然而如何评估其与人类偏好的对齐仍面临重大挑战。为解决此问题,我们提出Chatbot Arena,一个基于人类偏好评估大语言模型的开放平台。该方法采用成对比较策略,通过众包方式汇集多元化用户群体反馈。该平台已运行数月,累计收集超过24万次投票。本文详细阐述了平台架构、现有数据分析,并介绍了我们用于高效准确评估与模型排名的成熟统计方法。我们证实众包问题兼具充分多样性与区分度,且众包投票结果与专家评分高度一致。这些分析共同为Chatbot Arena的可信度奠定了坚实基础。凭借其独特价值与开放性,Chatbot Arena已成为被广泛引用的LLM排行榜之一,获得众多头部LLM开发企业与研究机构的多次引用。我们的演示系统已开源发布于\url{https://chat.lmsys.org}。