Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, the proposed DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.
翻译:细粒度图像检索旨在学习能够区分视觉相似物体并保持泛化能力的视觉表征。现有方法侧重于生成判别性特征,但鲜少考虑细粒度图像检索任务本身的特殊性。本文通过细致分析,提出了一套实用指南以识别子类别特定差异并生成判别性特征,从而设计有效的细粒度图像检索模型。该指南包括:强调物体本身(G1)、突出子类别特定差异(G2)以及采用有效训练策略(G3)。遵循G1与G2,我们为纯视觉Transformer设计了一种新颖的双重视觉过滤机制(简称DVF),用于捕获子类别特定差异。具体而言,该机制包含面向物体的模块与面向语义的模块,分别用于放大物体区域与定位判别性区域。遵循G3,我们实施了一种判别性模型训练策略,以提升DVF的判别力与泛化能力。大量分析与消融实验验证了所提指南的有效性。在不引入额外复杂模块的情况下,所提出的DVF在三个广泛使用的细粒度数据集上的闭集与开集场景中均取得了最先进性能。