Self-supervised learning (SSL) strategies have demonstrated remarkable performance in various recognition tasks. However, both our preliminary investigation and recent studies suggest that they may be less effective in learning representations for fine-grained visual recognition (FGVR) since many features helpful for optimizing SSL objectives are not suitable for characterizing the subtle differences in FGVR. To overcome this issue, we propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes, dubbed as common rationales in this paper. Intuitively, common rationales tend to correspond to the discriminative patterns from the key parts of foreground objects. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective without using any pre-trained object parts or saliency detectors, making it seamlessly to be integrated with the existing SSL process. Specifically, we fit the GradCAM with a branch with limited fitting capacity, which allows the branch to capture the common rationales and discard the less common discriminative patterns. At the test stage, the branch generates a set of spatial weights to selectively aggregate features representing an instance. Extensive experimental results on four visual tasks demonstrate that the proposed method can lead to a significant improvement in different evaluation settings.
翻译:自监督学习策略已在多种识别任务中展现出卓越性能。然而,我们的初步探究及近期研究表明,该类策略在细粒度视觉识别任务中学习表征时可能效果欠佳,因为许多有助于优化自监督学习目标的特征并不适合刻画细粒度识别中的细微差异。为解决此问题,我们提出学习一种附加筛选机制,以识别跨实例与类别普遍存在的判别性线索——本文称其为通用基理。直观而言,通用基理往往对应于前景物体关键部位的判别性模式。我们证明,仅需利用自监督学习目标所诱导的GradCAM即可习得通用基理检测器,无需依赖任何预训练物体部位或显著性检测器,从而使其能与现有自监督学习流程无缝集成。具体而言,我们通过拟合能力有限的分支结构对GradCAM进行适配,使该分支能够捕获通用基理并舍弃非通用判别模式。在测试阶段,该分支生成一组空间权重,用于选择性聚合代表实例的特征。在四个视觉任务上的大量实验表明,所提方法在不同评估设置下均能带来显著性能提升。