Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is available, while in general zero-shot situations, the test set may include observed classes. The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head. We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction. Furthermore, we minimize semantic mismatch by aligning image and word embeddings in the same dimension and comparing their respective representations using a combined loss, which comprises classification loss and CLIP loss. This strategy outperforms other methods and we achieve cutting-edge results on zero-shot multilabel classification tasks using CLIP-Decoder. Our method achieves an absolute increase of 3.9% in performance compared to existing methods for zero-shot learning multi-label classification tasks. Additionally, in the generalized zero-shot learning multi-label classification task, our method shows an impressive increase of almost 2.3%.
翻译:多标签分类是广泛应用于各类现实场景的核心任务。多标签零样本学习旨在将图像分类至多个未见类别,这些类别缺乏训练数据;而在广义零样本场景中,测试集可能同时包含已见类别与未见类别。CLIP-Decoder是一种基于前沿ML-Decoder注意力机制头部的新型方法。我们在CLIP-Decoder中引入了多模态表示学习:利用文本编码器提取文本特征,同时采用图像编码器进行图像特征提取。此外,通过将图像嵌入与词嵌入对齐至同一维度空间,并采用融合分类损失与CLIP损失的联合损失函数对比二者的表示,我们有效减少了语义失配。该策略显著优于现有方法,我们在零样本多标签分类任务中取得了突破性成果。相较于现有零样本多标签分类方法,本方法实现了3.9%的绝对性能提升。在广义零样本多标签分类任务中,本方法更展现出近2.3%的显著提升。