Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.
翻译:在两项公开的大型语言模型排行榜中,许多显示的成对排名在实际配对评估设计下未达到常规配对检验的分辨率目标:开放型语言模型排行榜v1中的40组成对比较中有11组,以及MMLU-Pro中9个前10名相邻排名对中有4组,在(α, 1-β)=(0.05, 0.8)下未能分辨。在真实主题层级聚类下,MMLU-Pro的未分辨数量上升至6/9,并在99.9%的类别自举重采样中保持5-6/9。我们将配对语言模型评估构架为一个假设检验问题,对水平为α、功效为1-β的检验进行逆推,并报告每对的分辨率比q = N/N*作为主要诊断指标。一个带有显式二阶常数的小效应锐利展开表明,在接近比较的范围内,广泛使用的非配对Cohen-h加(1-rho)快捷方式与正确的N*偏差约两倍,当用户对其每臂输出后乘(1-rho)时,五个现成计算器中的三个(Cohen 1988、G*Power、R pwr)会无声地继承这一缺陷。未分辨对的模式在多重性校正和随时有效的序贯检验下仍然存在。