The Value of Out-of-Distribution Data

from arxiv, Previous versions of this work have been presented at the Out-of-Distribution Generalization in Computer Vision (OOD-CV) Workshop (ECCV 2022) and the Workshop on Distribution Shifts (NeurIPS 2022)

We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.

翻译：我们通常预期，来自相似任务的样本越多，泛化误差越小；而来自分布外（OOD）任务的样本越多，泛化误差越大。然而，本文展示了一个反直觉的现象：目标任务的泛化误差关于OOD样本数量呈现非单调函数关系。随着OOD样本数量增加，目标任务上的泛化误差先改善后恶化，存在一个性能转折阈值。换言之，少量OOD数据的训练具有实际价值。我们采用合成数据集上的Fisher线性判别分析以及计算机视觉基准（如MNIST、CIFAR-10、CINIC-10、PACS和DomainNet）上的深度网络，对该现象进行验证与分析。在理想化设定下（已知样本是否为OOD），我们证明可通过加权优化目标与OOD经验风险来利用这种非单调趋势。尽管其实际应用价值有限，但该结果表明：若能检测OOD样本，则可能存在从中获益的方法。当无法区分OOD样本时，我们展示数据增强、超参数优化和预训练等常规策略不足以阻止目标泛化误差随数据集中OOD样本数量增加而恶化。

相关内容

泛化误差

关注 107

学习方法的泛化能力（Generalization Error）是由该方法学习到的模型对未知数据的预测能力，是学习方法本质上重要的性质。现实中采用最多的办法是通过测试泛化误差来评价学习方法的泛化能力。泛化误差界刻画了学习算法的经验风险与期望风险之间偏差和收敛速度。一个机器学习的泛化误差（Generalization Error），是一个描述学生机器在从样品数据中学习之后，离教师机器之间的差距的函数。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

分布外泛化(Out-Of-Distribution Generalization) 综述论文，22页pdf240篇文献

专知会员服务

64+阅读 · 2021年9月2日

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日