Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a technique to generate training samples with text-to-image (T2I) diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model's reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods.
翻译:基于经验风险最小化(ERM)训练的深度神经网络在训练数据与测试数据来自相同分布时表现良好,但在面对分布外样本时往往泛化能力不足。在图像分类任务中,这些模型可能依赖标签与图像无关特征之间常存在的伪相关性,导致当这些特征缺失时预测结果不可靠。本文提出一种利用文本到图像(T2I)扩散模型生成训练样本来解决伪相关性问题的方法。首先,我们通过文本反演机制计算样本因果成分对应视觉特征的最佳描述词元。随后,结合语言分割方法与扩散模型,通过将因果成分与其他类别的要素相融合来生成新样本。我们还基于ERM模型的预测概率与归因分数对生成样本进行精细剪枝,以确保其构成符合我们的目标。最后,我们在增强数据集上重新训练ERM模型。该过程通过让模型学习不存在伪相关性的精心构建样本,有效降低了模型对伪相关性的依赖。实验表明,在不同基准测试中,本方法相比现有最优方法获得了更高的最差组准确率。