Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.
翻译:近年来,大型语言模型(LLM)在推理能力方面的进展主要集中在数学和编程等领域,这些领域拥有丰富的高质量数据和客观的评价指标。相比之下,在医学和材料科学等科学领域,由于数据集覆盖范围有限以及开放式科学问题固有的复杂性,LLM 推理模型的进展仍然受限。为了应对这些挑战,我们推出了 WildSci,这是一个从同行评审文献中自动合成的领域特定科学问题新数据集,涵盖 9 个科学学科和 26 个子领域。通过将复杂的科学推理任务构建为多项选择题形式,我们能够利用定义明确的奖励信号进行可扩展的训练。我们进一步应用强化学习在这些数据上对模型进行微调,并分析由此产生的训练动态,包括领域特定性能变化、响应行为和泛化趋势。在一系列科学基准测试上的实验证明了我们数据集和方法的有效性。我们发布 WildSci 以促进科学推理领域可扩展且可持续的研究,数据集可通过 https://huggingface.co/datasets/JustinTX/WildSci 获取。