Mixture Data for Training Cannot Ensure Out-of-distribution Generalization

Deep neural networks often face generalization problems to handle out-of-distribution (OOD) data, and there remains a notable theoretical gap between the contributing factors and their respective impacts. Literature evidence from in-distribution data has suggested that generalization error can shrink if the size of mixture data for training increases. However, when it comes to OOD samples, this conventional understanding does not hold anymore -- Increasing the size of training data does not always lead to a reduction in the test generalization error. In fact, diverse trends of the errors have been found across various shifting scenarios including those decreasing trends under a power-law pattern, initial declines followed by increases, or continuous stable patterns. Previous work has approached OOD data qualitatively, treating them merely as samples unseen during training, which are hard to explain the complicated non-monotonic trends. In this work, we quantitatively redefine OOD data as those situated outside the convex hull of mixed training data and establish novel generalization error bounds to comprehend the counterintuitive observations better. Our proof of the new risk bound agrees that the efficacy of well-trained models can be guaranteed for unseen data within the convex hull; More interestingly, but for OOD data beyond this coverage, the generalization cannot be ensured, which aligns with our observations. Furthermore, we attempted various OOD techniques to underscore that our results not only explain insightful observations in recent OOD generalization work, such as the significance of diverse data and the sensitivity to unseen shifts of existing algorithms, but it also inspires a novel and effective data selection strategy.

翻译：深度神经网络常面临处理分布外数据的泛化问题，但关于其影响因素与各自作用之间仍存在显著理论空白。来自分布内数据的文献证据表明，若增大训练混合数据的规模，泛化误差可能减小。然而，对于分布外样本，这一传统认知不再适用——增加训练数据规模并不总能降低测试泛化误差。事实上，在不同偏移场景下可观察到多样化的误差趋势，包括幂律模式下的递减趋势、先降后升趋势，或持续稳定模式。以往研究对分布外数据进行定性处理，仅将其视为训练中未观测到的样本，这难以解释复杂的非单调趋势。本文中，我们定量地将分布外数据重新定义为位于训练混合数据凸包之外的数据，并建立全新的泛化误差上界以更好地理解这些反直觉观察。我们对新风险界限的证明证实，对于凸包内的未见数据，良好训练模型的有效性可得到保障；更有趣的是，对于超出此覆盖范围的分布外数据，泛化无法保障，这与我们的观察相符。此外，我们尝试多种分布外技术，强调本结果不仅能解释近期分布外泛化研究中的深刻发现（例如多样化数据的重要性以及现有算法对未见偏移的敏感性），还为新颖有效的数据选择策略提供了启示。