An open problem in Machine Learning is how to avoid models to exploit spurious correlations in the data; a famous example is the background-label shortcut in the Waterbirds dataset. A common remedy is to train a model across multiple environments; in the Waterbirds dataset, this corresponds to training by randomizing the background. However, selecting the right environments is a challenging problem, given that these are rarely known a priori. We propose Universal Adaptive Environment Discovery (UAED), a unified framework that learns a distribution over data transformations that instantiate environments, and optimizes any robust objective averaged over this learned distribution. UAED yields adaptive variants of IRM, REx, GroupDRO, and CORAL without predefined groups or manual environment design. We provide a theoretical analysis by providing PAC-Bayes bounds and by showing robustness to test environment distributions under standard conditions. Empirically, UAED discovers interpretable environment distributions and improves worst-case accuracy on standard benchmarks, while remaining competitive on mean accuracy. Our results indicate that making environments adaptive is a practical route to out-of-distribution generalization.
翻译:机器学习中的一个开放性问题是如何避免模型利用数据中的伪相关性;一个著名的例子是Waterbirds数据集中背景与标签的捷径关联。一种常见的补救措施是在多个环境中训练模型;在Waterbirds数据集中,这对应于通过随机化背景进行训练。然而,由于环境信息很少先验已知,选择合适的环境是一个具有挑战性的问题。我们提出通用自适应环境发现(UAED),这是一个统一框架,通过学习数据变换的分布来实例化环境,并优化在该学习分布上平均的任何鲁棒目标。UAED无需预定义组别或手动设计环境,即可生成IRM、REx、GroupDRO和CORAL的自适应变体。我们通过提供PAC-Bayes界,并在标准条件下展示对测试环境分布的鲁棒性,进行了理论分析。实证结果表明,UAED能够发现可解释的环境分布,并在保持平均准确率竞争力的同时,提升标准基准测试中的最差情况准确率。我们的研究结果表明,使环境自适应是实现分布外泛化的实用途径。