A fundamental challenge in science and engineering is the simulation-to-experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system's full underlying state. We propose a data-driven distribution alignment framework that bridges this simulation-to-experiment gap by pre-training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain-agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions -- initially trained on a simulated Boltzmann distribution -- with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at https://kaityrusnelson.com/ada/.
翻译:科学与工程中的一个基本挑战是模拟与实验之间的差距。尽管我们通常掌握物理定律的先验知识,但这些定律对于复杂系统而言可能难以精确求解。此类系统通常通过模拟器建模,这引入了计算近似。与此同时,实验测量更忠实地反映现实世界,但实验数据通常仅包含部分反映系统完整潜在状态的观测值。我们提出一种数据驱动的分布对齐框架,通过在全观测(但不完美)的模拟数据上预训练生成模型,然后将其与部分(但真实)的实验观测数据对齐,从而弥合模拟与实验之间的差距。尽管我们的方法具有领域无关性,我们通过引入对抗分布对齐(ADA)将其扎根于物理科学领域。该方法将原子位置的生成模型——最初基于模拟玻尔兹曼分布训练——与实验观测的分布对齐。我们证明,即使存在多个可能相关的观测量,该方法也能恢复目标可观测分布。我们还在合成数据、分子数据和实验蛋白质数据上进行了实证验证,表明该方法能够将生成模型与多种观测量对齐。我们的代码可在 https://kaityrusnelson.com/ada/ 获取。