From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from off-the-shelf large-scale generative models offers a promising direction to mitigate this issue by augmenting underrepresented subgroups in the real dataset. However, by using a mixed distribution of real and synthetic data, we introduce another source of bias due to distributional differences between synthetic and real data (\eg synthetic artifacts). As we will show, prior work's approach for using synthetic data to resolve the model's bias toward $B$ do not correct the model's bias toward the pair $(B, G)$, where $G$ denotes whether the sample is real or synthetic. Thus, the model could simply learn signals based on the pair $(B, G)$ (\eg, Synthetic Indoors) to make predictions about $Y$ (\eg, Big Dogs). To address this issue, we propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real (FFR). The first step of FFR pre-trains a model on balanced synthetic data to learn robust representations across subgroups. In the second step, FFR fine-tunes the model on real data using ERM or common loss-based bias mitigation methods. By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data and thus avoids the issue of bias toward the pair $(B, G)$. Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20\% over three datasets. Code available: \url{https://github.com/mqraitem/From-Fake-to-Real}

翻译：视觉识别模型容易学习由有偏训练集引发的伪相关性，其中某些条件$B$（例如室内环境）在某些类别$Y$（例如大型犬）中过度呈现。利用现成大规模生成模型产生的合成数据，通过增强真实数据集中代表性不足的子组，为解决此问题提供了有前景的方向。然而，通过混合使用真实与合成数据，我们引入了因合成与真实数据间分布差异（例如合成伪影）导致的另一种偏差源。正如我们将展示的，先前工作中利用合成数据解决模型对$B$偏倚的方法，并未纠正模型对二元组$(B, G)$的偏倚，其中$G$表示样本的真实性或合成性。因此，模型可能仅基于二元组$(B, G)$（例如合成室内图像）的信号来预测$Y$（例如大型犬）。为解决此问题，我们提出一种简单易实现的两阶段训练流程，称为"从虚假到真实"（FFR）。FFR的第一步在平衡合成数据上预训练模型，以学习跨子组的鲁棒表征。第二步中，FFR使用经验风险最小化（ERM）或常见的基于损失的偏倚缓解方法，在真实数据上对模型进行微调。通过分别使用真实与合成数据进行训练，FFR使模型不会暴露于真实与合成数据间的统计差异，从而避免了对二元组$(B, G)$产生偏倚的问题。实验表明，在三个数据集上，FFR将最差组准确率较现有最优方法提升了最高达20\%。代码地址：\url{https://github.com/mqraitem/From-Fake-to-Real}