Accurately describing images via text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect. Considering this, we ask how to distinguish the actual discriminative power of descriptions from performance boosts that potentially rely on an ensembling effect. To study this, we propose an alternative evaluation scenario that shows a characteristic behavior if the used descriptions have discriminative power. Furthermore, we propose a training-free method to select discriminative descriptions that work independently of classname ensembling effects. The training-free method works in the following way: A test image has a local CLIP label neighborhood, i.e., its top-$k$ label predictions. Then, w.r.t. to a small selection set, we extract descriptions that distinguish each class well in the local neighborhood. Using the selected descriptions, we demonstrate improved classification accuracy across seven datasets and provide in-depth analysis and insights into the explainability of description-based image classification by VLMs.
翻译:通过文本准确描述图像是可解释人工智能的基础。近期,视觉语言模型(如CLIP)通过在共享嵌入空间中对齐图像与文本来解决这一问题,从而表达视觉与语言嵌入之间的语义相似性。利用大语言模型生成的描述可以改进视觉语言模型的分类性能。然而,由于性能提升可能源于与语义无关的集成效应,因此难以确定实际描述语义的具体贡献。考虑到这一点,我们探讨如何区分描述的实际判别能力与可能依赖集成效应带来的性能提升。为此,我们提出一种替代性评估方案,该方案在使用描述具备判别力时展现出特定行为特征。此外,我们提出一种无需训练的方法来筛选具有判别力的描述,使其独立于类别名称的集成效应。该免训练方法的工作原理如下:测试图像具有局部CLIP标签邻域,即其前$k$个标签预测结果。随后,针对小型选择集,我们提取能在局部邻域中有效区分各类别的描述。通过使用筛选后的描述,我们在七个数据集上展示了分类准确率的提升,并对基于描述的视觉语言模型图像分类的可解释性提供了深入分析与见解。