Out-of-distribution (OOD) generalization is a complicated problem due to the idiosyncrasies of possible distribution shifts between training and test domains. Most benchmarks employ diverse datasets to address this issue; however, the degree of the distribution shift between the training domains and the test domains of each dataset remains largely fixed. This may lead to biased conclusions that either underestimate or overestimate the actual OOD performance of a model. Our study delves into a more nuanced evaluation setting that covers a broad range of shift degrees. We show that the robustness of models can be quite brittle and inconsistent under different degrees of distribution shifts, and therefore one should be more cautious when drawing conclusions from evaluations under a limited range of degrees. In addition, we observe that large-scale pre-trained models, such as CLIP, are sensitive to even minute distribution shifts of novel downstream tasks. This indicates that while pre-trained representations may help improve downstream in-distribution performance, they could have minimal or even adverse effects on generalization in certain OOD scenarios of the downstream task if not used properly. In light of these findings, we encourage future research to conduct evaluations across a broader range of shift degrees whenever possible.
翻译:分布外(OOD)泛化是一个复杂的问题,原因在于训练域与测试域之间可能存在的分布偏移具有特异性。大多数基准测试采用多样化的数据集来解决这一问题;然而,每个数据集的训练域与测试域之间的分布偏移程度在很大程度上是固定的。这可能导致有偏的结论,要么低估要么高估模型实际的OOD性能。本研究深入探讨了一个更细致的评估设置,覆盖了广泛的偏移程度。我们发现,模型在不同分布偏移程度下的鲁棒性可能相当脆弱且不一致,因此在有限偏移程度范围内的评估中得出结论时应更加谨慎。此外,我们观察到大规模预训练模型(如CLIP)对新型下游任务中即使是微小的分布偏移也非常敏感。这表明,虽然预训练表示可能有助于提升下游任务在分布内的性能,但如果使用不当,它们在某些OOD场景中对下游任务的泛化可能产生极小甚至不利的影响。基于这些发现,我们鼓励未来的研究在可能的情况下,在更广泛的偏移程度范围内进行评估。