Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.
翻译:广义零样本学习(GZSL)是一种利用图像属性训练深度学习模型以识别未见类别的技术。在本文中,我们提出了一种新的GZSL方法,该方法利用视觉Transformer(ViT)最大化图像特征中包含的属性相关信息。ViT在保持图像分辨率不降低的前提下处理整个图像区域,并将局部图像信息保留在图像块特征中。为充分利用ViT的这些优势,我们在提取与属性相关的图像特征时,不仅利用CLS特征,还利用图像块特征。具体而言,我们提出了一种新颖的基于注意力的模块——属性注意力模块(AAM),用于聚合图像块特征中的属性相关信息。在AAM中,每个图像块特征与合成图像属性之间的相关性被用作每个图像块的重要性权重。通过在基准数据集上进行大量实验,我们证明了所提方法在性能上大幅超越了当前最先进的GZSL方法。