Out-of-Distribution (OOD) generalization, a cornerstone for building robust machine learning models capable of handling data diverging from the training set's distribution, is an ongoing challenge in deep learning. While significant progress has been observed in computer vision and natural language processing, its exploration in tabular data, ubiquitous in many industrial applications, remains nascent. To bridge this gap, we present Wild-Tab, a large-scale benchmark tailored for OOD generalization in tabular regression tasks. The benchmark incorporates 3 industrial datasets sourced from fields like weather prediction and power consumption estimation, providing a challenging testbed for evaluating OOD performance under real-world conditions. Our extensive experiments, evaluating 10 distinct OOD generalization methods on Wild-Tab, reveal nuanced insights. We observe that many of these methods often struggle to maintain high-performance levels on unseen data, with OOD performance showing a marked drop compared to in-distribution performance. At the same time, Empirical Risk Minimization (ERM), despite its simplicity, delivers robust performance across all evaluations, rivaling the results of state-of-the-art methods. Looking forward, we hope that the release of Wild-Tab will facilitate further research on OOD generalization and aid in the deployment of machine learning models in various real-world contexts where handling distribution shifts is a crucial requirement.
翻译:分布外(OOD)泛化是构建能够处理偏离训练集分布数据的鲁棒机器学习模型的基石,也是深度学习领域持续面临的挑战。尽管计算机视觉和自然语言处理领域已取得显著进展,但在诸多工业应用中普遍存在的表格数据上的相关研究仍处于起步阶段。为弥合这一差距,我们提出了Wild-Tab,一个专为表格回归任务中OOD泛化设计的大规模基准测试。该基准涵盖来自天气预报和电力消耗估计等领域的三组工业数据集,为评估真实场景下的OOD性能提供了具有挑战性的测试平台。我们在Wild-Tab上对10种不同的OOD泛化方法进行了广泛实验,揭示了诸多微妙发现。我们观察到,这些方法中的许多在未见过的数据上难以维持高水平性能,其OOD性能相比同分布性能出现显著下降。同时,经验风险最小化(ERM)方法尽管简单,却在所有评估中展现出了鲁棒性,其表现可与最优方法相媲美。展望未来,我们希望Wild-Tab的发布能够促进OOD泛化的进一步研究,并助力机器学习模型在诸多需要处理分布偏移这一关键需求的实际场景中应用部署。