Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
翻译:零样本学习(ZSL)通过视觉-语义交互,将语义知识从已见类别迁移到未见类别,这一过程依赖于语义信息(如属性)的支持。然而,现有的ZSL方法通常仅使用预训练网络骨干(如CNN或ViT)提取视觉特征,由于缺乏语义信息的引导,未能学习匹配的视觉-语义对应关系以表示语义相关的视觉特征,从而导致不理想的视觉-语义交互。为解决这一问题,我们提出了一种用于零样本学习的渐进式语义引导视觉Transformer(简称ZSLViT)。ZSLViT在整个网络中主要考虑两个特性:i)显式地发现语义相关的视觉表示;ii)丢弃语义无关的视觉信息。具体而言,我们首先引入语义嵌入标记学习,通过语义增强来改善视觉-语义对应关系,并利用语义引导的标记注意力显式地发现语义相关的视觉标记。随后,我们融合低语义-视觉对应度的视觉标记,以丢弃语义无关的视觉信息,实现视觉增强。这两种操作被集成到多个编码器中,以渐进式地学习语义相关的视觉表示,从而在ZSL中实现精确的视觉-语义交互。大量实验表明,我们的ZSLViT在三个主流基准数据集(即CUB、SUN和AWA2)上取得了显著的性能提升。代码发布于:https://github.com/shiming-chen/ZSLViT。