Surrogate models are often used as computationally efficient approximations to complex simulation models, enabling tasks such as solving inverse problems, sensitivity analysis, and probabilistic forward predictions, which would otherwise be computationally infeasible. During training, surrogate parameters are fitted such that the surrogate reproduces the simulation model's outputs as closely as possible. However, the simulation model itself is merely a simplification of the real-world system, often missing relevant processes or suffering from misspecifications e.g., in inputs or boundary conditions. Hints about these might be captured in real-world measurement data, and yet, we typically ignore those hints during surrogate building. In this paper, we propose two novel probabilistic approaches to integrate simulation data and real-world measurement data during surrogate training. The first method trains separate surrogate models for each data source and combines their predictive distributions, while the second incorporates both data sources by training a single surrogate. Both hybrid modeling approaches employ a novel weighting strategy for combining heterogeneous data sources during surrogate training, which operates independently of the chosen surrogate family. We show the conceptual differences and benefits of the two approaches through both synthetic and real-world case studies. The results demonstrate the potential of these methods to improve predictive accuracy, predictive coverage, and to diagnose problems in the underlying simulation model. These insights can improve system understanding and future model development.
翻译:代理模型常被用作复杂仿真模型的计算高效近似,支撑逆问题求解、敏感性分析和概率正向预测等任务,而这些任务若直接使用原始模型将因计算成本过高而无法实现。在训练过程中,代理参数被拟合使得代理模型尽可能精确地复现仿真模型的输出。然而,仿真模型本身只是现实世界系统的简化,往往缺失相关过程或存在输入/边界条件等规范错误。这些问题的线索可能蕴含在现实测量数据中,但我们在构建代理模型时通常忽略了这些线索。本文提出两种新颖的概率方法,在代理训练过程中融合仿真数据与现实测量数据。第一种方法为每个数据源分别训练独立代理模型并组合其预测分布,第二种方法则通过训练单一代理模型同时整合两类数据源。两种混合建模方法均采用新型加权策略来融合异质数据源,该策略与所选代理族无关。我们通过合成案例与现实案例研究展示了两种方法的概念差异及优势。结果表明,这些方法在提升预测精度、预测覆盖率方面具有潜力,并能够诊断底层仿真模型存在的问题。这些见解可增强系统理解并指导未来模型开发。