This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.
翻译:本文研究多标签场景下的零样本学习这一挑战性问题,其中模型基于已见类别和辅助知识(如语义信息)训练以识别样本(如图像)中的多个未见类别。现有方法通常从空间或语义特征维度分析样本中各类已见类别之间的关系,并将学习到的模型迁移至未见类别。然而,这些方法忽视了局部特征与全局特征的完整性。尽管注意力结构的使用能准确定位局部特征(尤其是物体),但会显著丧失完整性,类别间关系也会受到影响。对全局特征的粗略处理同样直接影响全面性。这种忽视会导致模型失去对图像主要组成部分的把握。在推理阶段仅依赖已见类别的局部存在会引入不可避免的偏差。本文提出一种新颖且全面的多标签零样本学习视觉-语义框架Epsilon,以充分利用这些特性并实现更准确、更鲁棒的视觉-语义投影。在空间信息方面,我们通过将图像特征分组聚合为多个语义提示来实现有效细化。该方法能聚合语义信息而非类别信息,从而保持语义间的关联性。在全局语义方面,我们采用全局前向传播以尽可能收集信息,确保语义不被遗漏。在大规模多标签零样本学习基准数据集NUS-Wide和Open-Images-v4上的实验表明,所提出的Epsilon方法以显著优势优于其他先进方法。