We uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue, we propose two simple yet effective calibration strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple detailed pieces of evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score. Extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. To facilitate future research on more robust large language model comparison, we integrate the techniques in the paper into an easy-to-use toolkit \emph{FairEval}, along with the human annotations.\footnote{\url{https://github.com/i-Eval/FairEval}}
翻译:我们揭示了采用大型语言模型(LLMs),如GPT-4,作为裁判评估候选模型生成回复质量的系统性偏差。研究发现,仅通过简单改变候选回复在上下文中的呈现顺序,即可轻易篡改其质量排名。这种操作使我们能够扭曲评估结果,使某一模型表现显著优于另一模型(例如,vicuna在80个测试问题中可击败ChatGPT达66次)。为解决此问题,我们提出两种简单有效的校准策略:1)多重证据校准,要求评估模型在给出评分前生成多个详细证据;2)平衡位置校准,通过聚合不同顺序下的评估结果确定最终分数。大量实验表明,我们的方法成功减轻了评估偏差,使其与人类判断更趋一致。为促进大型语言模型更鲁棒比较的未来研究,我们将论文中的技术整合为易于使用的工具包FairEval,并配套提供人工标注数据。\footnote{\url{https://github.com/i-Eval/FairEval}}