We propose a framework for robust evaluation of reasoning capabilities of language models, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with functionalization of other benchmarks to follow. When evaluating current state-of-the-art models over snapshots of MATH(), we find a reasoning gap -- the percentage difference between the static and functional accuracies. We find reasoning gaps from 58.35% to 80.31% among the state-of-the-art closed and open weights models that perform well on static benchmarks, with the caveat that the gaps are likely to be smaller with more sophisticated prompting strategies. Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building "gap 0" models. Code for evaluation and new evaluation datasets, three MATH() snapshots, are publicly available at https://github.com/consequentai/fneval/.
翻译:我们提出了一种利用基准测试功能变体对语言模型推理能力进行稳健评估的框架。能够解决推理测试的模型,在处理问题的静态版本与功能变体的快照时,其表现不应存在差异。我们将MATH基准测试的相关部分改写为其功能变体MATH(),其他基准测试的功能化将陆续进行。在评估当前最先进模型在MATH()快照上的表现时,我们发现了一个推理差距——即静态准确率与功能准确率之间的百分比差异。我们发现,在静态基准测试中表现良好的最先进封闭和开放权重模型中,推理差距介于58.35%至80.31%之间,但需要说明的是,采用更复杂的提示策略可能会缩小这一差距。我们在此证明,那些传闻中在现实任务中具有良好推理性能的模型,其量化差距更小,这引出了构建“零差距”模型的开放式问题。评估代码及三个MATH()快照组成的新评估数据集已在https://github.com/consequentai/fneval/ 上公开。