从平局中得出结论：重新审视竞技场式LLM评估中的偏好语义 (Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation)

In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

翻译：在大型语言模型（LLM）的竞技场式评估中，两个LLM对用户查询作出响应，用户选择获胜响应或判定“对战”为平局，从而调整两个模型的评分。当前建模此类评分动态的主流方法是将对战视为双人游戏比赛（如国际象棋），并应用Elo评分系统及其衍生方法。本文批判性地审视了这一范式。具体而言，我们质疑平局是否真的意味着两个模型实力相当，进而质疑其评分是否应被等同。相反，我们推测平局更能反映查询难度：若查询过于简单，则两个模型更可能同等成功。在三个真实世界竞技场数据集上，我们证明忽略平局时的评分更新可使所研究的全部四种评分系统的对战结果预测准确率（包含平局）相对提升1-3%。进一步分析表明，平局更常出现在被评定为非常简单或高度客观的查询中，风险比分别为1.37和1.35。我们建议未来的评分系统重新考量现有的平局语义，并在评分更新中纳入查询属性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日