Different distribution shifts require different algorithmic and operational interventions. Methodological research must be grounded by the specific shifts they address. Although nascent benchmarks provide a promising empirical foundation, they implicitly focus on covariate shifts, and the validity of empirical findings depends on the type of shift, e.g., previous observations on algorithmic performance can fail to be valid when the $Y|X$ distribution changes. We conduct a thorough investigation of natural shifts in 5 tabular datasets over 86,000 model configurations, and find that $Y|X$-shifts are most prevalent. To encourage researchers to develop a refined language for distribution shifts, we build WhyShift, an empirical testbed of curated real-world shifts where we characterize the type of shift we benchmark performance over. Since $Y|X$-shifts are prevalent in tabular settings, we identify covariate regions that suffer the biggest $Y|X$-shifts and discuss implications for algorithmic and data-based interventions. Our testbed highlights the importance of future research that builds an understanding of how distributions differ.
翻译:不同的分布偏移需要不同的算法和操作干预。方法学研究必须立足于其具体应对的偏移类型。尽管新兴基准提供了有前景的经验基础,但它们隐含地聚焦于协变量偏移,且经验结论的有效性取决于偏移类型——例如,当$Y|X$分布发生变化时,先前关于算法性能的观察结果可能失效。我们对5个表格数据集中的自然偏移进行了全面研究,涵盖超过86,000个模型配置,发现$Y|X$偏移最为普遍。为鼓励研究人员构建更精细的分布偏移描述语言,我们建立了WhyShift——一个精心策划的真实世界偏移经验测试平台,在其中对基准性能评估所针对的偏移类型进行刻画。鉴于表格场景中$Y|X$偏移普遍存在,我们识别了遭受最严重$Y|X$偏移的协变量区域,并讨论了对算法干预与数据干预的启示。我们的测试平台彰显了未来研究理解分布差异原理的重要性。