The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually rely on high-quality samples of large volumes. In this paper, we challenge conventional assumptions about data requirements in RL for LLMs by demonstrating the effectiveness of one-shot reinforcement learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary reasoning improvement. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology; (2) Analysis of salient mathematical skills provides insight into the characteristics associated with effective polymath samples; and (3) An engineered synthetic sample that integrates multidisciplinary elements and broader skill coverage achieves stronger performance than naturally occurring individual samples. Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of samples that complements simply increasing data volume.
翻译:大型语言模型(LLM)的推理能力可通过强化学习(RL)激发(OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025)。现有LLM中RL尝试的成功通常依赖于大量高质量样本。本文挑战了关于LLM中RL数据需求的传统假设,展示了单次强化学习的有效性。具体而言,我们提出博学学习(polymath learning)框架,用于设计能引发多学科推理能力提升的单条训练样本。我们获得三项关键发现:(1)单个经策略选择的数学推理样本可在物理、化学和生物学等多个领域产生显著的性能提升;(2)对显著数学技能的分析揭示了有效博学样本相关特性的内在规律;(3)整合多学科要素与更广泛技能覆盖的人工合成样本,其性能优于自然出现的单个样本。在各类推理基准测试中,博学学习实现了优于更大数据集的性能,表明样本中的推理结构与技能,而非数量,或许是解锁语言模型增强推理能力的关键。我们的结果预示着一种名为“样本工程”(sample engineering)的范式转变,即对样本进行精准设计,以补充单纯增加数据量的方法。