Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Spurious correlations are unstable statistical associations that hinder robust decision-making. Conventional wisdom suggests that models relying on such correlations will fail to generalize out-of-distribution (OOD), especially under strong distribution shifts. However, empirical evidence challenges this view as naive in-distribution empirical risk minimizers often achieve the best OOD accuracy across popular OOD generalization benchmarks. In light of these results, we propose a different perspective: many widely used benchmarks for evaluating robustness to spurious correlations are misspecified. Specifically, they fail to include shifts in spurious correlations that meaningfully impact OOD generalization, making them unsuitable for evaluating the benefit of removing such correlations. We establish conditions under which a distribution shift can reliably assess a model's reliance on spurious correlations. Crucially, under these conditions, we should not observe a strong positive correlation between in-distribution and OOD accuracy, often called "accuracy on the line." Yet, most state-of-the-art benchmarks exhibit this pattern, suggesting they do not effectively assess robustness. Our findings expose a key limitation in current benchmarks used to evaluate domain generalization algorithms, that is, models designed to avoid spurious correlations. We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.

翻译：伪相关是不稳定的统计关联，会阻碍稳健决策。传统观点认为，依赖此类相关性的模型将无法实现分布外泛化，尤其在强分布偏移下。然而，实证证据对此观点提出了挑战：在流行的分布外泛化基准测试中，朴素的分布内经验风险最小化模型往往能获得最佳的分布外准确率。基于这些结果，我们提出不同视角：许多广泛使用的评估伪相关鲁棒性的基准测试存在设定不当问题。具体而言，这些基准未能包含对分布外泛化产生实质性影响的伪相关性偏移，导致其不适合评估消除此类相关性的实际效益。我们建立了能够可靠评估模型对伪相关性依赖程度的分布偏移条件。关键在于，在此类条件下，我们不应观察到分布内准确率与分布外准确率之间存在强正相关性——即通常所称的“准确率线性相关”现象。然而，大多数前沿基准测试都呈现这种模式，表明其未能有效评估鲁棒性。我们的研究揭示了当前用于评估领域泛化算法（即旨在避免伪相关性的模型）的基准测试存在关键局限。我们强调需要重新思考如何评估对伪相关的鲁棒性，确定领域应优先采用的设定合理的基准测试，并系统阐述设计未来基准测试的策略，以切实反映分布偏移下的鲁棒性。