The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that images often have several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement~(LCM) strategy to select the local features from the training images (i.e.\ the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.
翻译:多标签少样本图像分类(ML-FSIC)的目标是在每个标签仅拥有少量训练样本的场景下,为图像分配语义标签。多标签场景的一个关键特征是图像通常具有多个标签,这些标签通常对应图像不同区域中出现的物体。在基于度量的框架中估计标签原型时,确定哪些区域与哪些标签相关至关重要,但有限的训练数据以及局部特征的噪声特性使得这一任务极具挑战性。为此,我们提出一种逐步细化标签原型的策略。首先,我们利用词向量初始化原型,从而能够利用关于标签含义的先验知识。其次,借助这些初始原型,我们采用损失变化度量(LCM)策略从训练图像(即支持集)中筛选最可能代表特定标签的局部特征。第三,我们通过多模态交叉交互机制聚合这些代表性局部特征以构建标签的最终原型,该机制再次依赖于基于词向量的初始原型。在COCO、PASCAL VOC、NUS-WIDE和iMaterialist数据集上的实验表明,我们的模型显著提升了当前最优性能。