Spurious correlations occur when a model learns unreliable features from the data and are a well-known drawback of data-driven learning. Although there are several algorithms proposed to mitigate it, we are yet to jointly derive the indicators of spurious correlations. As a result, the solutions built upon standalone hypotheses fail to beat simple ERM baselines. We collect some of the commonly studied hypotheses behind the occurrence of spurious correlations and investigate their influence on standard ERM baselines using synthetic datasets generated from causal graphs. Subsequently, we observe patterns connecting these hypotheses and model design choices.
翻译:当模型从数据中学习到不可靠特征时,会产生虚假关联现象,这是数据驱动学习公认的缺陷。尽管已有多种算法被提出用于缓解该问题,但目前仍缺乏能够系统推导虚假关联指示指标的联合框架。因此,基于孤立假设构建的解决方案始终无法超越简单的经验风险最小化基线方法。本研究整合了当前关于虚假关联成因的若干主流假设,通过因果图生成的合成数据集,系统考察了这些假设对标准ERM基线模型的影响。在此基础上,我们进一步发现了连接这些假设与模型设计选择之间的规律性模式。