Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.
翻译:准确用文本描述图像是可解释人工智能的基础。像CLIP这样的视觉语言模型(VLMs)最近通过将图像和文本对齐到共享嵌入空间来解决这一问题,从而表达了视觉和语言嵌入之间的语义相似性。利用大语言模型(LLMs)生成的描述可以改进VLM分类。然而,很难确定实际描述语义的贡献,因为性能提升也可能源于与语义无关的集成效应,即多个修改后的文本提示作为原始提示的噪声测试时增强。我们提出了一种替代评估方案,以判断LLM生成描述带来的性能提升是由此类噪声增强效应引起,还是由真正的描述语义引起。所提出的方案避免了噪声测试时增强,并确保由真正、独特的描述导致性能提升。此外,我们提出了一种无需训练的方法,用于选择独立于类别名集成效应的判别性描述。我们的方法识别出在局部CLIP标签邻域内有效区分类别的描述,从而在七个数据集上提高了分类准确性。此外,我们还为基于描述的VLM图像分类的可解释性提供了见解。