The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional cues, high-level concepts, or even random characters often outperform those using only the category name. In many classification tasks, while the top-1 accuracy may be relatively low, the top-5 accuracy is often significantly higher. This gap implies that most misclassifications occur among a few similar classes, highlighting the model's difficulty in distinguishing between classes with subtle differences. To address this challenge, we introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. By generating and integrating these comparative descriptors into the classification framework, we refine the semantic focus and improve classification accuracy. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space, further enhancing performance. Our approach demonstrates improved accuracy and robustness in visual classification tasks by addressing the specific challenge of subtle inter-class differences.
翻译:视觉语言模型(如CLIP)在视觉分类任务中的性能,通过利用大型语言模型(如GPT)的语义知识得到了提升。近期研究表明,在零样本分类任务中,融入额外线索、高层概念甚至随机字符的描述符,其表现往往优于仅使用类别名称的描述符。在许多分类任务中,尽管Top-1准确率可能相对较低,但Top-5准确率通常显著更高。这一差距表明,大多数误分类发生在少数相似类别之间,凸显了模型在区分具有细微差异的类别时面临的困难。为应对这一挑战,我们提出了一种新颖的比较描述符概念。这些描述符通过强调目标类别相对于其最相似类别的独特特征,以增强区分度。通过生成这些比较描述符并将其整合到分类框架中,我们细化了语义焦点并提升了分类准确率。一个额外的过滤过程确保这些描述符在CLIP空间中更接近图像嵌入,从而进一步提升了性能。我们的方法通过针对细微类间差异这一具体挑战,在视觉分类任务中展现了更高的准确率和鲁棒性。