Zero-shot learning (ZSL) aims to recognize novel classes through transferring shared semantic knowledge (e.g., attributes) from seen classes to unseen classes. Recently, attention-based methods have exhibited significant progress which align visual features and attributes via a spatial attention mechanism. However, these methods only explore visual-semantic relationship in the spatial dimension, which can lead to classification ambiguity when different attributes share similar attention regions, and semantic relationship between attributes is rarely discussed. To alleviate the above problems, we propose a Dual Relation Mining Network (DRMN) to enable more effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion and conducts spatial attention for visual to semantic embedding. Moreover, an attribute-guided channel attention is utilized to decouple entangled semantic features. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images. Additionally, a global classification branch is introduced as a complement to human-defined semantic attributes, and we then combine the results with attribute-based classification. Extensive experiments demonstrate that the proposed DRMN leads to new state-of-the-art performances on three standard ZSL benchmarks, i.e., CUB, SUN, and AwA2.
翻译:零样本学习(ZSL)旨在通过将共享语义知识(如属性)从可见类转移到未见类来识别新类别。近年来,基于注意力的方法通过空间注意力机制对齐视觉特征与属性,取得了显著进展。然而,这些方法仅探索空间维度的视觉-语义关系,当不同属性共享相似注意力区域时可能导致分类歧义,且属性间的语义关系鲜有讨论。为解决上述问题,我们提出双重关系挖掘网络(DRMN),以实现更有效的视觉-语义交互,并学习属性间的语义关系以促进知识迁移。具体而言,我们引入双重注意力模块(DAB)用于视觉-语义关系挖掘,通过多级特征融合丰富视觉信息,并进行空间注意力实现视觉到语义的嵌入。此外,采用属性引导的通道注意力解耦纠缠的语义特征。针对语义关系建模,我们利用语义交互Transformer(SIT)增强图像间属性表征的泛化能力。同时引入全局分类分支作为人工定义语义属性的补充,并将结果与基于属性的分类相结合。大量实验表明,所提出的DRMN在三个标准ZSL基准数据集(CUB、SUN和AwA2)上达到了新的最优性能。