While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a ''complexity wall,'' MOOSE-Star exhibits continuous test-time scaling.
翻译:尽管大型语言模型(LLMs)在科学发现中展现出潜力,现有研究主要集中于推理或反馈驱动的训练,而对生成式推理过程 $P(\text{hypothesis}|\text{background})$($P(h|b)$)的直接建模仍未被探索。我们证明,由于从庞大知识库中检索和组合灵感所固有的组合复杂性($O(N^k)$),直接训练 $P(h|b)$ 在数学上是不可处理的。为突破这一壁垒,我们提出了 MOOSE-Star,一个实现可处理训练与可扩展推理的统一框架。在最佳情况下,MOOSE-Star 通过以下方式将复杂度从指数级降低至对数级($O(\log N)$):(1)基于从发现概率方程导出的分解子任务进行训练,(2)采用动机引导的层次化搜索以实现对数级检索并剪枝无关子空间,以及(3)利用有界组合来增强对检索噪声的鲁棒性。为此,我们发布了 TOMATO-Star 数据集,包含 108,717 篇分解后的论文(消耗 38,400 GPU 小时)用于训练。此外,我们证明,虽然暴力采样会遭遇“复杂性墙”,但 MOOSE-Star 展现出持续的测试时扩展能力。