Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar classes. Recently, methods with Vision Transformer (ViT) have demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to distinguish potentially discriminative regions while disregarding the rest. However, such approaches may struggle to effectively focus on truly discriminative regions due to only relying on the inherent self-attention mechanism, resulting in the classification token likely aggregating global information from less-important background patches. Moreover, due to the immense lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, we introduce a simple yet effective Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT`s attention maps is boosted through salient masking of potentially discriminative foreground regions. Extensive experiments demonstrate that with the standard training procedure our SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.
翻译:细粒度视觉分类(FGVC)是一个具有挑战性的计算机视觉问题,其任务是从从属类别中自动识别物体。其主要难点之一在于捕捉视觉相似类别间最具判别力的类间差异。近年来,基于视觉Transformer(ViT)的方法在FGVC中取得了显著进展,通常通过采用自注意力机制并结合额外资源消耗技术来区分潜在判别区域,同时忽略其余部分。然而,这类方法可能难以有效聚焦于真正具有判别力的区域,因为其仅依赖固有的自注意力机制,导致分类令牌可能聚合来自次要背景补丁的全局信息。此外,由于数据点极度匮乏,分类器可能无法找到最有帮助的类间区分特征,因为其他无关但独特的背景区域可能被错误地识别为有价值。为此,我们提出了一种简单而有效的显著性掩膜引导视觉Transformer(SM-ViT),通过显著掩膜潜在判别的前景区域,增强了标准ViT注意力图的判别能力。大量实验表明,采用标准训练流程,我们的SM-ViT在现有基于ViT的方法中,于流行的FGVC基准数据集上实现了最先进的性能,同时所需资源更少且输入图像分辨率更低。