MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model

With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), many existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literature largely sharpen existing solution modes rather than discovering entirely new ones. Such sharpening stands in contrast to the broader promise of RL: to foster exploration and to acquire new skills. To move beyond this plateau, we introduce MATH-Beyond (MATH-B), a benchmark deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets. Improving performance on our benchmark via RL requires methods that learn to reason in ways that go beyond base model capabilities in repeated sampling. Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math. Validating our premise, RL fine-tuned models such as Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview perform poorly on MATH-B at $\texttt{pass@1024}$, showing how existing approaches fall short on tackling harder instances. We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities. We release MATH-B at https://huggingface.co/datasets/brendel-group/MATH-Beyond.

翻译：随着DeepSeek-R1的问世，涌现出一波新的强化学习（RL）方法，这些方法似乎能够解锁更强的数学推理能力。然而，仔细观察开源生态系统会发现一个关键局限：在足够多的采样次数下（例如 $\texttt{pass@1024}$），许多现有的基础模型已经能够解决广泛使用的数学基准测试（如MATH-500和AIME 2024）中几乎所有问题。这表明，当前大语言模型推理文献中主流的RL微调方法，主要是对现有解题模式进行锐化，而非发现全新的模式。这种锐化与强化学习更广泛的承诺——促进探索和获取新技能——形成了对比。为了突破这一平台期，我们引入了MATH-Beyond（MATH-B），这是一个专门构建的基准测试，旨在即使在大规模采样预算下，也能击败参数规模高达80亿的常见开源模型。要通过RL方法在我们的基准测试上提升性能，就需要学习方法能够在重复采样中，以超越基础模型固有能力的推理方式进行推理。由于问题选自DAPO-Math-17K和DeepScaleR数据集的子集，它们在主题上仍等同于标准的高中数学。验证我们的前提，经过RL微调的模型，如Nemotron-Research-Reasoning-Qwen-1.5B和DeepScaleR-1.5B-Preview，在 $\texttt{pass@1024}$ 下于MATH-B上表现不佳，这显示了现有方法在处理更困难实例时的不足。我们希望MATH-B能够催化探索驱动的RL方法，从而激发更深层次的推理能力。我们在 https://huggingface.co/datasets/brendel-group/MATH-Beyond 上发布了MATH-B。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日