Models for fine-grained image classification tasks, where the difference between some classes can be extremely subtle and the number of samples per class tends to be low, are particularly prone to picking up background-related biases and demand robust methods to handle potential examples with out-of-distribution (OOD) backgrounds. To gain deeper insights into this critical problem, our research investigates the impact of background-induced bias on fine-grained image classification, evaluating standard backbone models such as Convolutional Neural Network (CNN) and Vision Transformers (ViT). We explore two masking strategies to mitigate background-induced bias: Early masking, which removes background information at the (input) image level, and late masking, which selectively masks high-level spatial features corresponding to the background. Extensive experiments assess the behavior of CNN and ViT models under different masking strategies, with a focus on their generalization to OOD backgrounds. The obtained findings demonstrate that both proposed strategies enhance OOD performance compared to the baseline models, with early masking consistently exhibiting the best OOD performance. Notably, a ViT variant employing GAP-Pooled Patch token-based classification combined with early masking achieves the highest OOD robustness.
翻译:面向细粒度图像分类任务的模型,其中某些类别间的差异可能极其细微且每类样本数量通常较少,这类模型尤其容易受背景相关偏差的影响,因此需要鲁棒的方法来处理可能包含分布外(OOD)背景的样本。为深入洞察这一关键问题,本研究探究了背景诱导偏差对细粒度图像分类的影响,评估了标准骨干模型如卷积神经网络(CNN)和视觉Transformer(ViT)。我们探索了两种缓解背景诱导偏差的掩码策略:早期掩码在(输入)图像层面移除背景信息,后期掩码则选择性屏蔽与背景对应的高层空间特征。大量实验评估了CNN与ViT模型在不同掩码策略下的表现,重点关注其对分布外背景的泛化能力。研究结果表明,与基线模型相比,两种策略均能提升分布外性能,其中早期掩码始终展现出最优的分布外表现。值得注意的是,采用基于全局平均池化补丁令牌分类并结合早期掩码的ViT变体实现了最高的分布外鲁棒性。