Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited. ML models trained on the resulting scarce high-fidelity data have high variance and are sensitive to vagaries of the training data set. We propose a new multifidelity training approach for scientific machine learning that exploits the scientific context where data of varying fidelities and costs are available; for example high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data to define new multifidelity Monte Carlo estimators for the unknown parameters of linear regression models, and provide theoretical analyses that guarantee the approach's accuracy and improved robustness to small training budgets. Numerical results verify the theoretical analysis and demonstrate that multifidelity learned models trained on scarce high-fidelity data and additional low-fidelity data achieve order-of-magnitude lower model variance than standard models trained on only high-fidelity data of comparable cost. This illustrates that in the scarce data regime, our multifidelity training strategy yields models with lower expected error than standard training approaches.
翻译:机器学习(ML)方法通过拟合参数化模型类的参数来学习复杂工程系统的替代模型,已在传统模拟成本高昂的领域引起广泛关注。然而在许多科学和工程场景中,生成用于训练ML模型的高保真数据代价高昂,且训练数据的可用预算有限。基于稀缺高保真数据训练的ML模型存在高方差问题,且对训练数据集的随机波动高度敏感。为此,我们提出一种新型科学机器学习多保真训练方法,利用不同保真度和成本的数据(例如高保真数据来自昂贵的全解析物理模拟,低保真数据源自基于简化假设的低成本模型)这一科学背景。我们利用多保真数据为线性回归模型的未知参数定义新型多保真蒙特卡洛估计量,并从理论上证明该方法在训练预算有限时的准确性和鲁棒性提升。数值结果验证了理论分析,表明基于稀缺高保真数据与额外低保真数据训练的多保真模型,其方差比消耗同等成本仅使用高保真数据训练的标准模型低一个数量级。这证明在稀缺数据场景下,我们的多保真训练策略能获得比标准训练方法预期误差更低的模型。