The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
翻译:大型语言模型(LLM)的推理能力可通过强化学习(RL)得以释放(OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025)。现有在LLM中成功的RL尝试通常依赖于数千乃至更多的高质量样本。本文通过展示单样本学习的显著有效性,挑战了关于LLM强化学习中数据需求的基本假设。具体而言,我们引入了博学学习——一种设计单个训练样本以激发多学科影响的框架。我们提出了三个关键发现:(1)一个经过策略性选择的数学推理样本,能够通过RL在包括物理、化学和生物在内的多个领域产生显著的性能提升;(2)对推理至关重要的数学技能揭示了最优博学样本的特征;(3)一个融合多学科要素的工程化合成样本,其训练效果优于使用自然产生的单个样本。我们的方法在各种推理基准测试中均取得了优于使用更大数据集进行训练的性能,这表明样本质量与设计(而非数量)可能是解锁语言模型增强推理能力的关键。我们的结果预示着一个被称为样本工程的转变,即从单纯增加数据量转向对训练样本进行精准工程化设计。