Utilizing potent representations of the large vision-language models (VLMs) to accomplish various downstream tasks has attracted increasing attention. Within this research field, soft prompt learning has become a representative approach for efficiently adapting VLMs such as CLIP, to tasks like image classification. However, most existing prompt learning methods learn text tokens that are unexplainable, which cannot satisfy the stringent interpretability requirements of Explainable Artificial Intelligence (XAI) in high-stakes scenarios like healthcare. To address this issue, we propose a novel explainable prompt learning framework that leverages medical knowledge by aligning the semantics of images, learnable prompts, and clinical concept-driven prompts at multiple granularities. Moreover, our framework addresses the lack of valuable concept annotations by eliciting knowledge from large language models and offers both visual and textual explanations for the prompts. Extensive experiments and explainability analyses conducted on various datasets, with and without concept labels, demonstrate that our method simultaneously achieves superior diagnostic performance, flexibility, and interpretability, shedding light on the effectiveness of foundation models in facilitating XAI. The code will be made publically available.
翻译:利用大型视觉-语言模型(VLM)的强大表示能力完成各类下游任务日益受到关注。在该研究领域,软提示学习已成为高效适配CLIP等VLM模型到图像分类等任务的代表性方法。然而,现有大多数提示学习方法学习的文本标记缺乏可解释性,无法满足医疗等高风险场景中可解释人工智能(XAI)对严格可解释性的要求。为解决这一问题,我们提出一种新颖的可解释提示学习框架,通过在多粒度层面对齐图像语义、可学习提示和临床概念驱动提示来利用医学知识。此外,我们的框架通过从大型语言模型中激发知识来解决概念标注缺失问题,并为提示提供视觉和文本双模态解释。在包含概念标签和无概念标签的多种数据集上进行的广泛实验与可解释性分析表明,本方法在诊断性能、灵活性和可解释性方面均表现出优越性,为基础模型促进XAI的有效性提供了新见解。相关代码将公开发布。