The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
翻译:"LLM作为评判者"范式已成为评估开放式生成任务的标准方法。为应对成对比较带来的二次方扩展成本,主流基准测试(如Arena-Hard和AlpacaEval)通常将所有模型与单一锚点进行比较。然而,尽管该方法被广泛采用,锚点选择对结果可靠性的影响仍未得到充分探索。本研究通过Arena-Hard-v2.0数据集系统评估22种不同锚点,深入探究锚点选择的影响。研究发现锚点选择至关重要:不当的锚点会显著降低与人类排序的相关性。我们指出常见锚点选择(最佳性能与最差性能模型)效果欠佳,因为这些极端锚点始终优于或劣于所有其他模型,难以反映模型间的相对排序关系。我们进一步量化锚点选择的效应规模,证明其影响程度与评判模型选择相当。最后提出两项可操作性建议:首先通过功效分析计算基于锚点评估所需的基准测试规模,发现标准基准规模既不足以支持成对评估,也无法可靠区分竞争模型;其次提供信息性锚点的选择准则,以确保评估实践的可靠性与高效性。