Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.
翻译:评估大型语言模型(LLM)的数学推理能力受限于基准数据集规模有限和模型固有的随机性,导致准确率估计方差较高且跨平台排名不稳定。在难题上,LLM可能无法生成最终正确答案,但仍能提供可靠的成对比较信号,指示两个候选解决方案中哪个更优。我们利用这一观察设计了一个统计高效的评估框架,将标准标注结果与通过模型评判辅助推理链获得的成对比较信号相结合。将这些比较信号视为控制变量,我们基于有效影响函数(EIF)开发了一种半参数估计器,适用于观测到辅助推理链的场景。由此得到的一步估计器达到半参数效率边界,保证相比朴素样本平均的严格方差缩减,并允许通过渐近正态性进行原则性的不确定性量化。在模拟实验中,我们的一步估计器显著提升了排名准确性,且随着模型输出噪声增大,增益更为明显。在GPQA Diamond、AIME 2025和GSM8K数据集上的实验进一步证明了该方法能实现更精确的性能估计和更可靠的模型排名,尤其在传统评估极不稳定的小样本场景中表现突出。