In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.
翻译:在大型语言模型(LLM)的竞技场式评估中,两个LLM对用户查询作出响应,用户选择获胜响应或判定“对战”为平局,从而调整两个模型的评分。当前建模此类评分动态的主流方法是将对战视为双人游戏比赛(如国际象棋),并应用Elo评分系统及其衍生方法。本文批判性地审视了这一范式。具体而言,我们质疑平局是否真的意味着两个模型实力相当,进而质疑其评分是否应被等同。相反,我们推测平局更能反映查询难度:若查询过于简单,则两个模型更可能同等成功。在三个真实世界竞技场数据集上,我们证明忽略平局时的评分更新可使所研究的全部四种评分系统的对战结果预测准确率(包含平局)相对提升1-3%。进一步分析表明,平局更常出现在被评定为非常简单或高度客观的查询中,风险比分别为1.37和1.35。我们建议未来的评分系统重新考量现有的平局语义,并在评分更新中纳入查询属性。