The expected loss is an upper bound to the model generalization error which admits robust PAC-Bayes bounds for learning. However, loss minimization is known to ignore misspecification, where models cannot exactly reproduce observations. This leads to significant underestimates of parameter uncertainties in the large data, or underparameterized, limit. We analyze the generalization error of near-deterministic, misspecified and underparametrized surrogate models, a regime of broad relevance in science and engineering. We show posterior distributions must cover every training point to avoid a divergent generalization error and derive an ensemble \textit{ansatz} that respects this constraint, which for linear models incurs minimal overhead. The efficient approach is demonstrated on model problems before application to high dimensional datasets in atomistic machine learning. Parameter uncertainties from misspecification survive in the underparametrized limit, giving accurate prediction and bounding of test errors.
翻译:预期损失是模型泛化误差的上界,该上界能够为学习过程提供稳健的PAC-Bayes界。然而,损失最小化已知会忽略设定偏差,即模型无法精确复现观测数据的情况。这导致在大数据或欠参数化极限下,参数不确定性被显著低估。我们分析了近确定性、存在设定偏差且欠参数化的替代模型的泛化误差,该模型设置广泛存在于科学与工程领域中。研究表明,后验分布必须覆盖每个训练点,以避免泛化误差发散,并推导出满足此约束的集成\textit{ansatz},该方案在线性模型中仅带来极小的额外开销。我们首先在模型问题上演示了这一高效方法,随后将其应用于原子尺度机器学习的高维数据集。来自设定偏差的参数不确定性在欠参数化极限下持续存在,从而实现了测试误差的准确预测与误差边界界定。