Neural networks can fail when the data contains spurious correlations. To understand this phenomenon, researchers have proposed numerous spurious correlations benchmarks upon which to evaluate mitigation methods. However, we observe that these benchmarks exhibit substantial disagreement, with the best methods on one benchmark performing poorly on another. We explore this disagreement, and examine benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods. Our results have implications for both benchmarks and mitigations: we find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use. We present a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.
翻译:神经网络在数据存在伪相关时可能失效。为理解这一现象,研究者提出了众多伪相关基准以评估缓解方法。然而,我们观察到这些基准存在显著分歧——在某个基准上表现最优的方法在另一基准上效果欠佳。我们探究了这种分歧,并通过定义基准应满足的三个理想特性来检验其有效性,这些特性是基准能够有意义评估方法的前提。我们的研究结果对基准和方法均具有启示:我们发现某些基准并非衡量方法性能的有效标准,且多种方法尚未达到广泛适用所需的鲁棒性。我们为实践者提供了一套简易方案,可根据与具体问题最相似的基准来选择适用方法。