While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework that enables tractable and scalable training of $P(h|b)$, while supporting more scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Empirically, MOOSE-Star scales continuously with training data and inference budget, whereas direct brute-force sampling hits a complexity wall.
翻译:尽管大型语言模型(LLMs)在科学发现方面展现出潜力,但现有研究主要关注推理或反馈驱动的训练,尚未探索生成式推理过程$P(\text{假设}|\text{背景})$($P(h|b)$)的直接建模。我们证明,由于从庞大知识库中检索和组合灵感时固有的组合复杂度($O(N^k)$),直接训练$P(h|b)$在数学上存在难解性。为突破这一壁垒,我们提出MOOSE-Star统一框架,该框架不仅能实现$P(h|b)$的可处理、可扩展训练,还可支持更具扩展性的推理。在最优情况下,MOOSE-Star通过以下策略将复杂度从指数级降至对数级($O(\log N)$):(1)基于科学发现的概率方程训练分解后的子任务;(2)采用动机引导的分层搜索实现对数级检索并剪枝无关子空间;(3)利用有界组合提升对检索噪声的鲁棒性。为支持训练,我们发布了包含108,717篇分解论文(耗费38,400 GPU小时)的TOMATO-Star数据集。实验表明,MOOSE-Star的训练量和推理预算呈持续扩展趋势,而直接暴力采样的方法则会遭遇复杂度壁垒。