Multifidelity linear regression for scientific machine learning from scarce data

Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited. ML models trained on the resulting scarce high-fidelity data have high variance and are sensitive to vagaries of the training data set. We propose a new multifidelity training approach for scientific machine learning that exploits the scientific context where data of varying fidelities and costs are available; for example high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data to define new multifidelity Monte Carlo estimators for the unknown parameters of linear regression models, and provide theoretical analyses that guarantee the approach's accuracy and improved robustness to small training budgets. Numerical results verify the theoretical analysis and demonstrate that multifidelity learned models trained on scarce high-fidelity data and additional low-fidelity data achieve order-of-magnitude lower model variance than standard models trained on only high-fidelity data of comparable cost. This illustrates that in the scarce data regime, our multifidelity training strategy yields models with lower expected error than standard training approaches.

翻译：机器学习(ML)方法通过拟合参数化模型类的参数来学习复杂工程系统的替代模型，已在传统模拟成本高昂的领域引起广泛关注。然而在许多科学和工程场景中，生成用于训练ML模型的高保真数据代价高昂，且训练数据的可用预算有限。基于稀缺高保真数据训练的ML模型存在高方差问题，且对训练数据集的随机波动高度敏感。为此，我们提出一种新型科学机器学习多保真训练方法，利用不同保真度和成本的数据（例如高保真数据来自昂贵的全解析物理模拟，低保真数据源自基于简化假设的低成本模型）这一科学背景。我们利用多保真数据为线性回归模型的未知参数定义新型多保真蒙特卡洛估计量，并从理论上证明该方法在训练预算有限时的准确性和鲁棒性提升。数值结果验证了理论分析，表明基于稀缺高保真数据与额外低保真数据训练的多保真模型，其方差比消耗同等成本仅使用高保真数据训练的标准模型低一个数量级。这证明在稀缺数据场景下，我们的多保真训练策略能获得比标准训练方法预期误差更低的模型。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日