Advances in multi-modal embeddings, and in particular CLIP, have recently driven several breakthroughs in Computer Vision (CV). CLIP has shown impressive performance on a variety of tasks, yet, its inherently opaque architecture may hinder the application of models employing CLIP as backbone, especially in fields where trust and model explainability are imperative, such as in the medical domain. Current explanation methodologies for CV models rely on Saliency Maps computed through gradient analysis or input perturbation. However, these Saliency Maps can only be computed to explain classes relevant to the end task, often smaller in scope than the backbone training classes. In the context of models implementing CLIP as their vision backbone, a substantial portion of the information embedded within the learned representations is thus left unexplained. In this work, we propose Concept Visualization (ConVis), a novel saliency methodology that explains the CLIP embedding of an image by exploiting the multi-modal nature of the embeddings. ConVis makes use of lexical information from WordNet to compute task-agnostic Saliency Maps for any concept, not limited to concepts the end model was trained on. We validate our use of WordNet via an out of distribution detection experiment, and test ConVis on an object localization benchmark, showing that Concept Visualizations correctly identify and localize the image's semantic content. Additionally, we perform a user study demonstrating that our methodology can give users insight on the model's functioning.
翻译:多模态嵌入技术的进步,特别是CLIP模型,近期推动了计算机视觉领域的多项突破。CLIP在各种任务中展现出卓越性能,但其固有的黑箱架构可能阻碍采用CLIP作为骨干网络模型的应用,尤其在医疗等对模型可信度和可解释性要求极高的领域。当前计算机视觉模型的解释方法主要依赖通过梯度分析或输入扰动计算的显著图。然而,这些显著图仅能解释与终端任务相关的类别,其范围通常小于骨干网络训练所涵盖的类别。对于采用CLIP作为视觉骨干的模型,学习表示中所嵌入的大部分信息因此无法得到解释。本研究提出概念可视化(ConVis)——一种新颖的显著性计算方法,通过利用嵌入的多模态特性来解释图像的CLIP嵌入表示。ConVis借助WordNet的词汇信息,可为任意概念(不限于终端模型训练所涉及的概念)计算任务无关的显著图。我们通过分布外检测实验验证了WordNet使用的有效性,并在物体定位基准测试中验证了ConVis,结果表明概念可视化能够准确识别并定位图像的语义内容。此外,我们开展的用户研究表明,该方法能帮助用户深入理解模型的内在机制。