While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a ''complexity wall,'' MOOSE-Star exhibits continuous test-time scaling.
翻译:尽管大语言模型在科学发现领域展现出潜力,现有研究仍集中于推理或反馈驱动的训练,未能探索生成推理过程$P(\text{假设}|\text{背景})$($P(h|b)$)的直接建模。我们证明,由于从海量知识库中检索与组合灵感所固有的组合复杂性($O(N^k)$),直接训练$P(h|b)$在数学上存在计算不可解性。为突破这一壁垒,我们提出统一框架MOOSE-Star,实现了可解训练与可扩展推理。在最佳情况下,MOOSE-Star通过以下策略将复杂性从指数级降低至对数级($O(\log N)$):(1) 基于科学发现的概率方程对分解子任务进行训练;(2) 采用动机引导的层次化搜索实现对数级检索并剪枝无关子空间;(3) 利用有界组合增强对检索噪声的鲁棒性。为此,我们发布了包含108,717篇分解论文(训练耗时38,400 GPU小时)的数据集TOMATO-Star。此外,结果表明当暴力采样遭遇"复杂度壁垒"时,MOOSE-Star仍能呈现持续的测试时可扩展性。