WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.

翻译：近年来，大型语言模型（LLM）在推理能力方面的进展主要集中在数学和编程等领域，这些领域拥有丰富的高质量数据和客观的评价指标。相比之下，在医学和材料科学等科学领域，由于数据集覆盖范围有限以及开放式科学问题固有的复杂性，LLM 推理模型的进展仍然受限。为了应对这些挑战，我们推出了 WildSci，这是一个从同行评审文献中自动合成的领域特定科学问题新数据集，涵盖 9 个科学学科和 26 个子领域。通过将复杂的科学推理任务构建为多项选择题形式，我们能够利用定义明确的奖励信号进行可扩展的训练。我们进一步应用强化学习在这些数据上对模型进行微调，并分析由此产生的训练动态，包括领域特定性能变化、响应行为和泛化趋势。在一系列科学基准测试上的实验证明了我们数据集和方法的有效性。我们发布 WildSci 以促进科学推理领域可扩展且可持续的研究，数据集可通过 https://huggingface.co/datasets/JustinTX/WildSci 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICML2025】MARGE：通过引导式探索提升大型语言模型的数学推理能力

专知会员服务

9+阅读 · 2025年5月20日

大语言模型推理前沿综述：推理扩展、推理学习与智能体系统

专知会员服务

38+阅读 · 2025年4月20日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

大规模语言模型推理的进展综述

专知会员服务

57+阅读 · 2025年2月8日