Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}
翻译:细粒度图像识别(FGIR)领域的先前工作已确立了骨干网络选择的重要性,但忽略了不同训练与评估设置下准确率与成本之间的权衡。本研究通过超过2000次实验,涵盖6种训练与评估设置、9个预训练骨干网络以及17个数据集,开展了大规模分析。针对细粒度训练中数据增强有效性的初步观察,促使我们对反事实注意力学习(Counterfactual Attention Learning, CAL)——一种基于数据感知裁剪与掩码增强的先进方法——提出扩展,引入跨图像判别区域混合增强。此外,我们提出一种高效的仅评估变体,通过放弃CAL及类似FGIR方法通常使用的判别性裁剪前向传递来降低推理成本,同时保持竞争性准确率。结果表明,仅在训练过程中使用数据感知增强即可使模型在无裁剪情况下实现优异准确率,显著减少推理成本。为支持未来研究,我们在以下网址共享代码与检查点:\url{https://github.com/arkel23/FGIR-Backbones}