Better May Not Be Fairer: A Study on Subgroup Discrepancy in Image Classification

In this paper, we provide 20,000 non-trivial human annotations on popular datasets as a first step to bridge gap to studying how natural semantic spurious features affect image classification, as prior works often study datasets mixing low-level features due to limitations in accessing realistic datasets. We investigate how natural background colors play a role as spurious features by annotating the test sets of CIFAR10 and CIFAR100 into subgroups based on the background color of each image. We name our datasets \textbf{CIFAR10-B} and \textbf{CIFAR100-B} and integrate them with CIFAR-Cs. We find that overall human-level accuracy does not guarantee consistent subgroup performances, and the phenomenon remains even on models pre-trained on ImageNet or after data augmentation (DA). To alleviate this issue, we propose \textbf{FlowAug}, a \emph{semantic} DA that leverages decoupled semantic representations captured by a pre-trained generative flow. Experimental results show that FlowAug achieves more consistent subgroup results than other types of DA methods on CIFAR10/100 and on CIFAR10/100-C. Additionally, it shows better generalization performance. Furthermore, we propose a generic metric, \emph{MacroStd}, for studying model robustness to spurious correlations, where we take a macro average on the weighted standard deviations across different classes. We show \textit{MacroStd} being more predictive of better performances; per our metric, FlowAug demonstrates improvements on subgroup discrepancy. Although this metric is proposed to study our curated datasets, it applies to all datasets that have subgroups or subclasses. Lastly, we also show superior out-of-distribution results on CIFAR10.1.

翻译：本文作为研究自然语义虚假特征如何影响图像分类的第一步，提供了20000个基于流行数据集的重要人工标注——先前研究因访问真实数据集的限制，常聚焦于带有低级特征混合的数据集。我们通过根据每个图像的背景颜色对CIFAR10和CIFAR100的测试集进行子组划分，探究自然背景颜色作为虚假特征的作用。将数据集命名为\textbf{CIFAR10-B}和\textbf{CIFAR100-B}，并与CIFAR-Cs集成。研究发现，总体人类水平精度并不能保证子组性能的一致性，即使在ImageNet预训练或数据增强（DA）后的模型上，这一现象依然存在。为缓解该问题，我们提出\textbf{FlowAug}——一种利用预训练生成流解耦语义表示的\textit{语义级}数据增强方法。实验结果表明，与CIFAR10/100及CIFAR10/100-C上的其他DA方法相比，FlowAug能获得更一致的子组结果，同时展现出更优的泛化性能。此外，我们提出通用度量指标\textit{MacroStd}用于研究模型对虚假相关性的鲁棒性，该指标通过计算不同类别加权标准差上的宏平均实现。研究表明\textit{MacroStd}能更好预测模型性能：根据该度量，FlowAug在子组差异方面表现出改进。尽管该度量是为我们构建的数据集而设计，但它适用于所有具有子组或子类的数据集。最后，我们在CIFAR10.1上展示了优异的分布外结果。