Machine learning (ML) methods, which fit to data the parameters of a given parameterized model class, have garnered significant interest as potential methods for learning surrogate models for complex engineering systems for which traditional simulation is expensive. However, in many scientific and engineering settings, generating high-fidelity data on which to train ML models is expensive, and the available budget for generating training data is limited, so that high-fidelity training data are scarce. ML models trained on scarce data have high variance, resulting in poor expected generalization performance. We propose a new multifidelity training approach for scientific machine learning via linear regression that exploits the scientific context where data of varying fidelities and costs are available: for example, high-fidelity data may be generated by an expensive fully resolved physics simulation whereas lower-fidelity data may arise from a cheaper model based on simplifying assumptions. We use the multifidelity data within an approximate control variate framework to define new multifidelity Monte Carlo estimators for linear regression models. We provide bias and variance analysis of our new estimators that guarantee the approach's accuracy and improved robustness to scarce high-fidelity data. Numerical results demonstrate that our multifidelity training approach achieves similar accuracy to the standard high-fidelity only approach with orders-of-magnitude reduced high-fidelity data requirements.
翻译:机器学习方法通过将参数拟合到给定参数化模型类的数据中,作为学习复杂工程系统替代模型的潜在方法已引起广泛关注,因为传统仿真方法对这些系统的计算成本高昂。然而,在许多科学与工程场景中,生成用于训练机器学习模型的高保真数据代价昂贵,且生成训练数据的可用预算有限,导致高保真训练数据稀缺。基于稀缺数据训练的机器学习模型具有高方差特性,导致其期望泛化性能较差。我们提出一种通过线性回归实现科学机器学习的新型多保真度训练方法,该方法充分利用了科学场景中可获取不同保真度与成本数据的特性:例如,高保真数据可通过昂贵的全解析物理仿真生成,而低保真数据则可来自基于简化假设的廉价模型。我们在近似控制变量框架内利用多保真度数据,为线性回归模型构建了新型多保真度蒙特卡洛估计量。通过偏差与方差分析,我们证明了该估计量的准确性及其对稀缺高保真数据具有更强的鲁棒性。数值实验表明,我们的多保真度训练方法在将高保真数据需求降低数个数量级的同时,仍能达到与标准纯高保真方法相当的精度。