Identifying labels that did not appear during training, known as multi-label zero-shot learning, is a non-trivial task in computer vision. To this end, recent studies have attempted to explore the multi-modal knowledge of vision-language pre-training (VLP) models by knowledge distillation, allowing to recognize unseen labels in an open-vocabulary manner. However, experimental evidence shows that knowledge distillation is suboptimal and provides limited performance gain in unseen label prediction. In this paper, a novel query-based knowledge sharing paradigm is proposed to explore the multi-modal knowledge from the pretrained VLP model for open-vocabulary multi-label classification. Specifically, a set of learnable label-agnostic query tokens is trained to extract critical vision knowledge from the input image, and further shared across all labels, allowing them to select tokens of interest as visual clues for recognition. Besides, we propose an effective prompt pool for robust label embedding, and reformulate the standard ranking learning into a form of classification to allow the magnitude of feature vectors for matching, which both significantly benefit label recognition. Experimental results show that our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images, respectively.
翻译:识别训练中未出现的标签(即多标签零样本学习)是计算机视觉领域一项具有挑战性的任务。为此,近期研究尝试通过知识蒸馏探索视觉语言预训练模型的多模态知识,从而以开放词汇方式识别未见标签。然而实验证据表明,知识蒸馏并非最优方法,在未见标签预测中仅能提供有限的性能提升。本文提出一种新颖的基于查询的知识共享范式,用于从预训练的VLP模型中挖掘多模态知识以实现开放词汇多标签分类。具体而言,我们训练一组可学习的与标签无关的查询令牌,从输入图像中提取关键视觉知识,并将其共享至所有标签,使各标签能选择感兴趣的令牌作为视觉线索进行识别。此外,我们提出一种有效的提示池进行鲁棒的标签嵌入,并将标准排序学习重构为分类形式,以允许特征向量幅度的匹配,这两项改进均显著提升了标签识别性能。实验结果表明,我们的框架在零样本任务上显著优于现有最优方法:在NUS-WIDE和Open Images数据集上mAP分别提升5.9%和4.5%。