仿真即监督：面向科学发现的机理预训练 (Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery)

Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While hybrid approaches like Physics-Informed Neural Networks (PINNs) embed domain knowledge as functional constraints, they can be brittle under model misspecification. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that instead embeds domain knowledge into the training data to establish a structural prior. By pretraining on synthetic corpora spanning diverse model structures and observational artifacts, SGNNs learn the broad patterns of physical possibility. This allows the model to internalize the underlying dynamics of a system without being forced to satisfy a single, potentially incorrect equation. We evaluated SGNNs across scientific disciplines and found that this approach confers significant robustness. In prediction tasks, SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines. In tests on dengue outbreaks, SGNNs outperformed physics-constrained models even when both were restricted to incorrect human-to-human transmission equations, demonstrating that SGNNs are potentially more robust to model misspecification. For inference, SGNNs extend the logic of simulation-based inference to enable supervised learning for unobservable targets, estimating early COVID-19 transmissibility more accurately than traditional methods. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability that maps real-world data back to the simulated manifold to identify underlying processes. By unifying these disparate simulation-based techniques into a single framework, we demonstrate that mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond the limitations of fixed functional forms.

翻译：科学建模面临着机理理论的可解释性与机器学习的预测能力之间的权衡。尽管如物理信息神经网络（PINNs）等混合方法将领域知识嵌入为函数约束，但在模型设定错误时可能表现脆弱。我们提出仿真基神经网络（SGNNs），该框架将领域知识嵌入训练数据以建立结构先验。通过在涵盖不同模型结构和观测伪影的合成语料库上进行预训练，SGNNs能够学习物理可能性的广泛模式。这使得模型能够内化系统的底层动力学，而无需强制满足单一且可能错误的方程。我们在多个科学学科中评估SGNNs，发现该方法赋予模型显著的鲁棒性。在预测任务中，SGNNs将COVID-19预测能力较CDC基线提升近三倍。在登革热疫情测试中，即使两者均受限于错误的人传人传播方程，SGNNs的表现仍优于物理约束模型，表明SGNNs对模型设定错误可能更具鲁棒性。在推断方面，SGNNs扩展了基于仿真的推断逻辑，使得对不可观测目标进行监督学习成为可能，其估计早期COVID-19传播能力的准确性优于传统方法。最后，SGNNs实现了回溯仿真归因——一种将现实数据映射回仿真流形以识别底层过程的机理可解释性方法。通过将这些不同的基于仿真的技术统一到单一框架中，我们证明机理仿真可作为鲁棒科学推断的有效训练数据，其泛化能力超越了固定函数形式的限制。