Several studies have compared the in-distribution (ID) and out-of-distribution (OOD) performance of models in computer vision and NLP. They report a frequent positive correlation and some surprisingly never even observe an inverse correlation indicative of a necessary trade-off. The possibility of inverse patterns is important to determine whether ID performance can serve as a proxy for OOD generalization capabilities. This paper shows with multiple datasets that inverse correlations between ID and OOD performance do happen in real-world data - not only in theoretical worst-case settings. We also explain theoretically how these cases can arise even in a minimal linear setting, and why past studies could miss such cases due to a biased selection of models. Our observations lead to recommendations that contradict those found in much of the current literature. - High OOD performance sometimes requires trading off ID performance. - Focusing on ID performance alone may not lead to optimal OOD performance. It may produce diminishing (eventually negative) returns in OOD performance. - In these cases, studies on OOD generalization that use ID performance for model selection (a common recommended practice) will necessarily miss the best-performing models, making these studies blind to a whole range of phenomena.
翻译:多项研究对比了计算机视觉与自然语言处理领域中模型在分布内(ID)与分布外(OOD)的性能表现。这些研究普遍报告两者存在正相关关系,部分研究甚至从未观察到表明存在必然权衡的负相关模式。探讨负相关模式的可能性对于判断ID性能能否作为OOD泛化能力的代理指标具有重要意义。本文通过多个数据集证明,ID与OOD性能的负相关确实存在于真实世界数据中——而不仅限于理论最坏情况。我们同时从理论层面阐释了即使在最小线性设定下,此类情况为何可能发生,以及过往研究因模型选择偏差而可能遗漏这些现象的原因。我们的观察结论与当前主流文献的推荐意见存在矛盾:高OOD性能有时需要牺牲ID性能;仅关注ID性能未必能获得最优OOD性能,反而可能导致OOD性能收益递减(甚至为负);在此类情况下,使用ID性能进行模型选择(这一常见推荐做法)进行OOD泛化研究时,必然遗漏最优性能模型,致使研究无法观测到完整现象谱系。