Misspecification uncertainties in near-deterministic regression

The expected loss is an upper bound to the model generalization error which admits robust PAC-Bayes bounds for learning. However, loss minimization is known to ignore misspecification, where models cannot exactly reproduce observations. This leads to significant underestimates of parameter uncertainties in the large data, or underparameterized, limit. We analyze the generalization error of near-deterministic, misspecified and underparametrized surrogate models, a regime of broad relevance in science and engineering. We show posterior distributions must cover every training point to avoid a divergent generalization error and derive an ensemble \textit{ansatz} that respects this constraint, which for linear models incurs minimal overhead. The efficient approach is demonstrated on model problems before application to high dimensional datasets in atomistic machine learning. Parameter uncertainties from misspecification survive in the underparametrized limit, giving accurate prediction and bounding of test errors.

翻译：预期损失是模型泛化误差的上界，该上界能够为学习过程提供稳健的PAC-Bayes界。然而，损失最小化已知会忽略设定偏差，即模型无法精确复现观测数据的情况。这导致在大数据或欠参数化极限下，参数不确定性被显著低估。我们分析了近确定性、存在设定偏差且欠参数化的替代模型的泛化误差，该模型设置广泛存在于科学与工程领域中。研究表明，后验分布必须覆盖每个训练点，以避免泛化误差发散，并推导出满足此约束的集成\textit{ansatz}，该方案在线性模型中仅带来极小的额外开销。我们首先在模型问题上演示了这一高效方法，随后将其应用于原子尺度机器学习的高维数据集。来自设定偏差的参数不确定性在欠参数化极限下持续存在，从而实现了测试误差的准确预测与误差边界界定。

相关内容

泛化误差

关注 107

学习方法的泛化能力（Generalization Error）是由该方法学习到的模型对未知数据的预测能力，是学习方法本质上重要的性质。现实中采用最多的办法是通过测试泛化误差来评价学习方法的泛化能力。泛化误差界刻画了学习算法的经验风险与期望风险之间偏差和收敛速度。一个机器学习的泛化误差（Generalization Error），是一个描述学生机器在从样品数据中学习之后，离教师机器之间的差距的函数。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日