We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
翻译:我们提出了一种科学推理基础模型,该模型将自然语言与异构的科学表征对齐。该模型在包含科学文本、纯序列及序列-文本对的206B令牌语料库上进行预训练,随后通过40M指令的监督微调进行对齐,采用退火冷启动引导以激发长形式思维链,并结合任务特定的奖励塑形进行强化学习,从而培养出审慎的科学推理能力。它支持四大能力族,涵盖工作流中多达103项任务:(i)文本与科学格式之间的忠实转换,(ii)文本/知识提取,(iii)属性预测,(iv)属性分类,(v)无条件与条件序列生成与设计。与专业系统相比,我们的方法拓宽了指令覆盖范围,提升了跨领域泛化能力,并增强了保真度。我们详细阐述了数据整理与训练过程,并证明跨学科学习强化了迁移能力与下游可靠性。该模型、指令调优数据集及评估代码已在 https://huggingface.co/SciReason 和 https://github.com/open-sciencelab/SciReason 开源。