Self-supervised learning (SSL) strategies have demonstrated remarkable performance in various recognition tasks. However, both our preliminary investigation and recent studies suggest that they may be less effective in learning representations for fine-grained visual recognition (FGVR) since many features helpful for optimizing SSL objectives are not suitable for characterizing the subtle differences in FGVR. To overcome this issue, we propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes, dubbed as common rationales in this paper. Intuitively, common rationales tend to correspond to the discriminative patterns from the key parts of foreground objects. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective without using any pre-trained object parts or saliency detectors, making it seamlessly to be integrated with the existing SSL process. Specifically, we fit the GradCAM with a branch with limited fitting capacity, which allows the branch to capture the common rationales and discard the less common discriminative patterns. At the test stage, the branch generates a set of spatial weights to selectively aggregate features representing an instance. Extensive experimental results on four visual tasks demonstrate that the proposed method can lead to a significant improvement in different evaluation settings.
翻译:自监督学习策略在多种识别任务中展现了卓越性能。然而,我们的初步研究及近期工作表明,这类策略在细粒度视觉识别任务中学习有效的表示时效果欠佳,这是因为有助于优化自监督学习目标的多类特征并不适合刻画细粒度视觉识别中的细微差异。为解决此问题,我们提出学习一种额外筛选机制,用于识别跨实例与类别常见的判别性线索,本文称之为通用关键特征。直观上,通用关键特征往往对应于前景对象关键部位的判别性模式。我们证明,仅利用自监督学习目标引发的GradCAM即可学习通用关键特征检测器,无需使用任何预训练的对象部件或显著性检测器,从而可无缝集成到现有自监督学习过程中。具体而言,我们使用拟合能力受限的分支对GradCAM进行拟合,使该分支能够捕获通用关键特征并舍弃非通用判别性模式。在测试阶段,该分支生成一组空间权重,用于选择性聚合表示实例的特征。在四项视觉任务上的广泛实验结果表明,所提方法能在不同评估设置下带来显著性能提升。