Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agentic-style iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based approaches typically implement such workflows via hand-crafted heuristic procedures, without an explicit probabilistic formulation. We recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data. This perspective provides a unified way to reason about model proposal, refinement, and selection within a single inference framework. As a concrete instantiation of this view, we introduce ModelSMC, an algorithm based on Sequential Monte Carlo sampling. ModelSMC represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria. Experiments on real-world scientific systems illustrate that this formulation discovers models with interpretable mechanisms and improves posterior predictive checks. More broadly, this perspective provides a probabilistic lens for understanding and developing LLM-based approaches to model discovery.
翻译:从观测数据中自动发现机理仿真模型的方法为加速科学进展提供了一条有前景的路径。此类方法通常采用智能体式的迭代工作流程,通过模仿人类发现过程反复提出并修订候选模型。然而,现有的基于LLM的方法通常通过手工设计的启发式程序来实现此类工作流程,缺乏明确的概率形式化表述。我们将模型发现重新定义为概率推断问题,即从能够解释数据的机理模型的未知分布中进行采样。这一视角为在单一推断框架内对模型提出、精炼和选择进行推理提供了统一的方式。作为该观点的具体实例,我们提出了ModelSMC算法,该算法基于序贯蒙特卡洛采样。ModelSMC将候选模型表示为粒子,这些粒子由LLM迭代提出和精炼,并使用基于似然的标准进行加权。在真实世界科学系统上的实验表明,该形式化方法能够发现具有可解释机理的模型,并改善了后验预测检验。更广泛而言,这一视角为理解和开发基于LLM的模型发现方法提供了一个概率框架。