Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.
翻译:现实识别系统常面临未见标签的挑战。为识别此类未见标签,多标签零样本学习通过预训练文本标签嵌入(如GloVe)进行知识迁移。然而,此类方法仅利用语言模型的单模态知识,忽略了图像-文本对中蕴含的丰富语义信息。最近发展的开放词汇方法成功利用了目标检测中图像-文本对的此类信息,并取得了显著性能。受开放词汇方法成功的启发,我们提出了一种新颖的开放词汇框架——多模态知识迁移,用于多标签分类。具体而言,我们的方法基于视觉与语言预训练模型,利用图像-文本对的多模态知识。为促进VLP模型图像-文本匹配能力的迁移,我们采用知识蒸馏确保图像与标签嵌入的一致性,并通过提示调优进一步更新标签嵌入。为增强多目标识别能力,我们开发了简洁有效的双流模块,用于捕捉局部与全局特征。大量实验结果表明,本方法在公共基准数据集上显著优于现有最优方法。源代码见https://github.com/sunanhe/MKT。