Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While existing hybrid approaches have made progress by incorporating domain knowledge into machine learning methods as functional constraints, they can be limited by a reliance on precise mathematical specifications. When the underlying equations are partially unknown or misspecified, enforcing rigid constraints can introduce bias and hinder a model's ability to learn from data. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that incorporates scientific theory by using mechanistic simulations as training data for neural networks. By pretraining on diverse synthetic corpora that span multiple model structures and realistic observational noise, SGNNs internalize the underlying dynamics of a system as a structural prior. We evaluated SGNNs across multiple disciplines, including epidemiology, ecology, social science, and chemistry. In forecasting tasks, SGNNs outperformed both standard data-driven baselines and physics-constrained hybrid models. They nearly tripled the forecasting skill of the average CDC models in COVID-19 mortality forecasts and accurately forecasted high-dimensional ecological systems. SGNNs demonstrated robustness to model misspecification, performing well even when trained on data with incorrect assumptions. Our framework also introduces back-to-simulation attribution, a method for mechanistic interpretability that explains real-world dynamics by identifying their most similar counterparts within the simulated corpus. By unifying these techniques into a single framework, we demonstrate that diverse mechanistic simulations can serve as effective training data for robust scientific inference.

翻译：科学建模面临着机制性理论的可解释性与机器学习预测能力之间的权衡。现有混合方法通过将领域知识作为函数约束融入机器学习方法已取得进展，但可能因依赖精确的数学规范而受限。当基础方程部分未知或存在错误设定时，施加刚性约束会引入偏差并阻碍模型从数据中学习的能力。我们提出仿真接地神经网络（SGNNs），该框架通过使用机制性仿真作为神经网络的训练数据来融入科学理论。通过在涵盖多种模型结构和现实观测噪声的多样化合成语料库上进行预训练，SGNNs将系统的内在动态内化为结构先验。我们在流行病学、生态学、社会科学和化学等多个学科中评估了SGNNs。在预测任务中，SGNNs的表现优于标准数据驱动基线模型和物理约束混合模型。在COVID-19死亡率预测中，SGNNs近乎将美国疾病控制与预防中心（CDC）模型的平均预测技能提高了三倍，并准确预测了高维生态系统。SGNNs展现出对模型错误设定的鲁棒性，即使使用带有错误假设的数据进行训练，也能表现良好。我们的框架还引入了“回仿指标注”方法，这是一种机制性可解释性方法，通过识别真实世界动态在仿真语料库中最相似的对应物来解释其动态。通过将这些技术统一到单一框架中，我们证明了多样的机制性仿真可作为有效训练数据，用于稳健的科学推断。