Machine learning systems struggle with robustness, under subpopulation shifts. This problem becomes especially pronounced in scenarios where only a subset of attribute combinations is observed during training -a severe form of subpopulation shift, referred as compositional shift. To address this problem, we ask the following question: Can we improve the robustness by training on synthetic data, spanning all possible attribute combinations? We first show that training of conditional diffusion models on limited data lead to incorrect underlying distribution. Therefore, synthetic data sampled from such models will result in unfaithful samples and does not lead to improve performance of downstream machine learning systems. To address this problem, we propose CoInD to reflect the compositional nature of the world by enforcing conditional independence through minimizing Fisher's divergence between joint and marginal distributions. We demonstrate that synthetic data generated by CoInD is faithful and this translates to state-of-the-art worst-group accuracy on compositional shift tasks on CelebA.
翻译:机器学习系统在子群体分布变化下常面临鲁棒性不足的问题。这一问题在训练期间仅观察到属性组合子集的情况下尤为突出——这种严重的子群体分布变化被称为组合性分布偏移。为解决该问题,我们提出以下研究:能否通过在所有可能属性组合上生成的合成数据进行训练来提升模型鲁棒性?我们首先证明,在有限数据上训练条件扩散模型会导致底层分布估计失准。因此,从此类模型采样的合成数据将产生不可靠样本,且无法提升下游机器学习系统的性能。针对此问题,我们提出CoInD方法,通过强制条件独立性来反映世界的组合性本质,具体通过最小化联合分布与边缘分布之间的费希尔散度实现。实验表明,CoInD生成的合成数据具有高度保真性,在CelebA数据集的组合性分布偏移任务上实现了最先进的组间最差准确率。