While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.
翻译:尽管机器学习模型在各类现实任务中不断刷新现有最高水平,但域外泛化问题仍具挑战性,因为这些模型易受虚假关联影响。我们提出了一种均衡小批量采样策略,基于数据生成过程中潜在因果机制的不变性,将偏态数据分布转化为无虚假关联的均衡分布。我们论证了在此均衡分布上训练的贝叶斯最优分类器在足够多样化的环境空间内具有极小极大最优性。当利用足够多的训练环境时,我们还为所提数据生成过程的潜变量模型提供了可识别性保证。在DomainBed基准上开展的实验表明,该方法在基准报告的20个基线方法中取得了最优性能。