Bayesian Surrogate Training on Multiple Data Sources: A Hybrid Modeling Strategy

Surrogate models are often used as computationally efficient approximations to complex simulation models, enabling tasks such as solving inverse problems, sensitivity analysis, and probabilistic forward predictions, which would otherwise be computationally infeasible. During training, surrogate parameters are fitted such that the surrogate reproduces the simulation model's outputs as closely as possible. However, the simulation model itself is merely a simplification of the real-world system, often missing relevant processes or suffering from misspecifications e.g., in inputs or boundary conditions. Hints about these might be captured in real-world measurement data, and yet, we typically ignore those hints during surrogate building. In this paper, we propose two novel probabilistic approaches to integrate simulation data and real-world measurement data during surrogate training. The first method trains separate surrogate models for each data source and combines their predictive distributions, while the second incorporates both data sources by training a single surrogate. Both hybrid modeling approaches employ a novel weighting strategy for combining heterogeneous data sources during surrogate training, which operates independently of the chosen surrogate family. We show the conceptual differences and benefits of the two approaches through both synthetic and real-world case studies. The results demonstrate the potential of these methods to improve predictive accuracy, predictive coverage, and to diagnose problems in the underlying simulation model. These insights can improve system understanding and future model development.

翻译：代理模型常被用作复杂仿真模型的计算高效近似，支撑逆问题求解、敏感性分析和概率正向预测等任务，而这些任务若直接使用原始模型将因计算成本过高而无法实现。在训练过程中，代理参数被拟合使得代理模型尽可能精确地复现仿真模型的输出。然而，仿真模型本身只是现实世界系统的简化，往往缺失相关过程或存在输入/边界条件等规范错误。这些问题的线索可能蕴含在现实测量数据中，但我们在构建代理模型时通常忽略了这些线索。本文提出两种新颖的概率方法，在代理训练过程中融合仿真数据与现实测量数据。第一种方法为每个数据源分别训练独立代理模型并组合其预测分布，第二种方法则通过训练单一代理模型同时整合两类数据源。两种混合建模方法均采用新型加权策略来融合异质数据源，该策略与所选代理族无关。我们通过合成案例与现实案例研究展示了两种方法的概念差异及优势。结果表明，这些方法在提升预测精度、预测覆盖率方面具有潜力，并能够诊断底层仿真模型存在的问题。这些见解可增强系统理解并指导未来模型开发。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《一种分层混合人工智能方法：在战斗模拟中整合深度强化学习与脚本代理》

专知会员服务

27+阅读 · 2025年12月6日

《面相混合威胁建模的贝叶斯网络方法》最新报告

专知会员服务

25+阅读 · 2025年7月30日

重新思考代理混合模型：混合不同的大型语言模型是否有益？

专知会员服务

20+阅读 · 2025年2月9日

预训练视觉模型的参数高效微调

专知会员服务

32+阅读 · 2024年3月19日