Long-tailed multi-label visual recognition (LTML) task is a highly challenging task due to the label co-occurrence and imbalanced data distribution. In this work, we propose a unified framework for LTML, namely prompt tuning with class-specific embedding loss (LMPT), capturing the semantic feature interactions between categories by combining text and image modality data and improving the performance synchronously on both head and tail classes. Specifically, LMPT introduces the embedding loss function with class-aware soft margin and re-weighting to learn class-specific contexts with the benefit of textual descriptions (captions), which could help establish semantic relationships between classes, especially between the head and tail classes. Furthermore, taking into account the class imbalance, the distribution-balanced loss is adopted as the classification loss function to further improve the performance on the tail classes without compromising head classes. Extensive experiments are conducted on VOC-LT and COCO-LT datasets, which demonstrates that the proposed method significantly surpasses the previous state-of-the-art methods and zero-shot CLIP in LTML. Our codes are fully available at \url{https://github.com/richard-peng-xia/LMPT}.
翻译:摘要:长尾多标签视觉识别(LTML)任务因标签共现与不平衡数据分布而极具挑战性。本文提出一种统一的LTML框架——基于类别特定嵌入损失的提示调优(LMPT),通过融合文本与图像模态数据捕捉类别间的语义特征交互,同步提升头部与尾部类别的性能。具体而言,LMPT引入带有类别感知软间隔与重加权的嵌入损失函数,借助文本描述(标题)学习类别特定上下文,有助于建立类别间的语义关联,尤其是头部与尾部类别之间。此外,针对类别不平衡问题,采用分布平衡损失作为分类损失函数,在不损害头部类别性能的同时进一步提升尾部类别表现。在VOC-LT与COCO-LT数据集上的大量实验表明,所提方法在LTML任务中显著超越现有最先进方法与零样本CLIP模型。我们的代码已在\url{https://github.com/richard-peng-xia/LMPT}上完全开放。