This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein, the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics, and transfer the learned model to unseen ones. But they ignore the effective integration of local and global features. That is, in the process of inferring unseen classes, global features represent the principal direction of the image in the feature space, while local features should maintain uniqueness within a certain range. This integrated neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. Specifically, we split the feature maps into several feature groups, of which each feature group can be trained independently with the Local Information Distinguishing Module (LID) to ensure uniqueness. Meanwhile, a Global Enhancement Module (GEM) is designed to preserve the principal direction. Besides, a static graph structure is designed to construct the correlation of local features. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.
翻译:本文研究多标签场景下零样本学习(MLZSL)这一具有挑战性的问题,其中模型基于已见类别和辅助知识(如语义信息)训练,以识别样本(如图像)中的多个未见类别。现有方法通常从空间或语义特征维度分析样本中不同已见类别的关系,并将学习到的模型迁移至未见类别。然而,这些方法忽略了局部与全局特征的有效整合。具体而言,在推断未见类别的过程中,全局特征表征图像在特征空间中的主方向,而局部特征应在一定范围内保持独特性。这种整合缺失会使模型失去对图像主要成分的把握,仅依靠推理阶段已见类别的局部存在性将引入不可避免的偏差。本文提出一种新颖有效的MLZSL组双增强框架(GBE-MLZSL),充分利用上述特性实现更准确鲁棒的视觉-语义投影。具体地,我们将特征图划分为多个特征组,每个特征组可独立通过局部信息区分模块(LID)训练以确保独特性;同时设计全局增强模块(GEM)保留主方向。此外,通过静态图结构构建局部特征的相关性。在大型MLZSL基准数据集NUS-WIDE和Open-Images-v4上的实验表明,所提出的GBE-MLZSL以较大优势优于其他最先进方法。