Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .

翻译：零样本学习（ZSL）通过视觉-语义交互，将语义知识从已见类别迁移到未见类别，这一过程依赖于语义信息（如属性）的支持。然而，现有的ZSL方法通常仅使用预训练网络骨干（如CNN或ViT）提取视觉特征，由于缺乏语义信息的引导，未能学习匹配的视觉-语义对应关系以表示语义相关的视觉特征，从而导致不理想的视觉-语义交互。为解决这一问题，我们提出了一种用于零样本学习的渐进式语义引导视觉Transformer（简称ZSLViT）。ZSLViT在整个网络中主要考虑两个特性：i）显式地发现语义相关的视觉表示；ii）丢弃语义无关的视觉信息。具体而言，我们首先引入语义嵌入标记学习，通过语义增强来改善视觉-语义对应关系，并利用语义引导的标记注意力显式地发现语义相关的视觉标记。随后，我们融合低语义-视觉对应度的视觉标记，以丢弃语义无关的视觉信息，实现视觉增强。这两种操作被集成到多个编码器中，以渐进式地学习语义相关的视觉表示，从而在ZSL中实现精确的视觉-语义交互。大量实验表明，我们的ZSLViT在三个主流基准数据集（即CUB、SUN和AWA2）上取得了显著的性能提升。代码发布于：https://github.com/shiming-chen/ZSLViT。