We study the problem of out-of-distribution (o.o.d.) generalization where spurious correlations of attributes vary across training and test domains. This is known as the problem of correlation shift and has posed concerns on the reliability of machine learning. In this work, we introduce the concepts of direct and indirect effects from causal inference to the domain generalization problem. We argue that models that learn direct effects minimize the worst-case risk across correlation-shifted domains. To eliminate the indirect effects, our algorithm consists of two stages: in the first stage, we learn an indirect-effect representation by minimizing the prediction error of domain labels using the representation and the class label; in the second stage, we remove the indirect effects learned in the first stage by matching each data with another data of similar indirect-effect representation but of different class label. We also propose a new model selection method by matching the validation set in the same way, which is shown to improve the generalization performance of existing models on correlation-shifted datasets. Experiments on 5 correlation-shifted datasets and the DomainBed benchmark verify the effectiveness of our approach.
翻译:我们研究分布外泛化问题,其中属性的虚假相关性在训练域和测试域之间发生变化。这被称为相关性偏移问题,已对机器学习的可靠性构成担忧。本文引入因果推断中的直接效应和间接效应概念到领域泛化问题中。我们认为,学习直接效应的模型能够最小化跨相关性偏移域的最坏情况风险。为消除间接效应,我们的算法包含两个阶段:第一阶段,通过使用表示和类别标签最小化域标签的预测误差,学习间接效应表示;第二阶段,通过将每个数据与具有相似间接效应表示但不同类别标签的另一数据匹配,消除第一阶段学到的间接效应。我们还提出一种新模型选择方法,通过相同方式匹配验证集,该方法在相关性偏移数据集上能提升现有模型的泛化性能。在5个相关性偏移数据集和DomainBed基准上的实验验证了我们方法的有效性。