Despite achieving impressive progress, current multi-label image recognition (MLR) algorithms heavily depend on large-scale datasets with complete labels, making collecting large-scale datasets extremely time-consuming and labor-intensive. Training the multi-label image recognition models with partial labels (MLR-PL) is an alternative way, in which merely some labels are known while others are unknown for each image. However, current MLP-PL algorithms rely on pre-trained image similarity models or iteratively updating the image classification models to generate pseudo labels for the unknown labels. Thus, they depend on a certain amount of annotations and inevitably suffer from obvious performance drops, especially when the known label proportion is low. To address this dilemma, we propose a dual-perspective semantic-aware representation blending (DSRB) that blends multi-granularity category-specific semantic representation across different images, from instance and prototype perspective respectively, to transfer information of known labels to complement unknown labels. Specifically, an instance-perspective representation blending (IPRB) module is designed to blend the representations of the known labels in an image with the representations of the corresponding unknown labels in another image to complement these unknown labels. Meanwhile, a prototype-perspective representation blending (PPRB) module is introduced to learn more stable representation prototypes for each category and blends the representation of unknown labels with the prototypes of corresponding labels, in a location-sensitive manner, to complement these unknown labels. Extensive experiments on the MS-COCO, Visual Genome, and Pascal VOC 2007 datasets show that the proposed DSRB consistently outperforms current state-of-the-art algorithms on all known label proportion settings.
翻译:尽管取得了显著进展,当前的多标签图像识别算法严重依赖具有完整标签的大规模数据集,这使得收集大规模数据集极其耗时费力。采用部分标签训练多标签图像识别模型是一种替代方案,其中每张图像仅已知部分标签,其余标签未知。然而,现有的部分标签多标签图像识别算法依赖于预训练的图像相似度模型或迭代更新图像分类模型来为未知标签生成伪标签,因此它们需要一定数量的标注,且当已知标签比例较低时,不可避免地会出现明显的性能下降。为解决这一困境,我们提出了一种双视角语义感知表示融合方法,分别从实例视角和原型视角跨不同图像融合多粒度类别特定语义表示,以将已知标签的信息传递至未知标签。具体而言,我们设计了实例视角表示融合模块,该模块将一张图像中已知标签的表示与另一张图像中相应未知标签的表示进行融合,以补全这些未知标签。同时,引入了原型视角表示融合模块,以学习每个类别更稳定的表示原型,并以位置敏感的方式将未知标签的表示与对应标签的原型进行融合,从而补全这些未知标签。在MS-COCO、Visual Genome和Pascal VOC 2007数据集上的大量实验表明,所提出的DSRB在所有已知标签比例设定下均持续优于当前最先进的算法。