Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset ($n{=}4181$) and a tagged, non-standard subset ($n{=}247$). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in $96.4\%$ of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
翻译:基准测试是追踪大型语言模型发展进程的重要工具,然而数据集和评估方法中的不准确性持续削弱其有效性。本文提出Omni-MATH-2,这是对Omni-MATH数据集进行人工修订的版本,包含一个精确答案的干净子集($n{=}4181$)和一个带标签的非标准子集($n{=}247$)。每个问题都经过审核以确保LaTeX可编译性、可解性和可验证性,包括补充缺失的图表或信息,标记需要证明、估算或图像的问题,并移除冗余内容。这一过程显著降低了数据集引起的噪声,从而为模型性能提供了更精确的评估。带注释的数据集还允许我们通过比较GPT-5 mini与原始Omni-Judge来评估评估者引起的噪声,结果显示在干净和带标签的问题子集上评估者之间存在显著差异。专家标注表明,在评估者存在分歧的情况下,Omni-Judge的错误率高达$96.4\%$,这表明其无法区分模型的能力差异,甚至在基准测试远未达到饱和之前即是如此。随着问题难度增加,我们发现需要能力更强的评估者来防止评估错误掩盖模型之间的真实差异。最后,两种评估方法均未能识别带标签问题子集中当前的失效模式,这表明数据集质量和评估者可靠性对于建立准确的模型性能基准都至关重要。