As Large Language Models (LLMs) achieve breakthroughs in complex reasoning, Codeforces-based Elo ratings have emerged as a prominent metric for evaluating competitive programming capabilities. However, these ratings are often reported without critical experimental details, leading to significant discrepancies illustrated by recent reports where the score of the same model version fluctuated by nearly 500 points. This paper presents a systematic empirical study on the hidden factors biasing Elo evaluations: (1) the temporal ordering of submissions, (2) contest difficulty selection, and (3) run to run stochastic variability of LLMs. Utilizing a controlled benchmark of 37 recent Codeforces contests and 13,691 generated test cases, we demonstrate that Elo scores are highly sensitive to these parameters. Our findings reveal that varying submission orders can shift scores by 394 points, while contest selection can cause differences of up to 1,122 points for the same model. Run to run performance exhibits substantial instability, with a maximum difference of 349 points in mean scores observed when evaluating identical contests. We conclude that direct Elo comparisons are unreliable and potentially misleading without strict standardization and transparent reporting of experimental settings.
翻译:随着大型语言模型(LLMs)在复杂推理领域取得突破性进展,基于Codeforces平台的Elo评分体系已成为评估竞争性编程能力的重要指标。然而,这些评分在报告时往往缺乏关键的实验细节,导致显著的数据不一致性——近期研究显示同一模型版本的评分波动可达近500分。本文通过系统性实证研究揭示了影响Elo评估的三类隐性偏差因素:(1)提交答案的时间顺序,(2)竞赛难度选择,以及(3)LLMs多次运行中的随机波动性。通过构建包含37场近期Codeforces竞赛和13,691个生成测试用例的受控基准,我们证明Elo评分对这些参数具有高度敏感性。实验结果表明:调整提交顺序可使评分偏移394分;竞赛选择差异可导致同一模型评分产生高达1,122分的波动;多次运行性能表现出显著不稳定性,在相同竞赛评估中观测到平均分最大差异达349分。本研究结论表明,若缺乏严格的标准化流程和实验设置的透明化报告,直接的Elo比较将不可靠且可能产生误导性结论。