Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their black-box nature and lack of explainability in predictions make them less trustworthy in critical domains. Recently, some work has been done to force VLMs to provide reasonable rationales for object recognition, but this often comes at the expense of classification accuracy. In this paper, we first propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluations of different datasets, our method demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. The code will be made available online upon publication.
翻译:大型视觉语言模型(VLMs),如CLIP,已显著推动包括物体识别与物体检测在内的多项计算机视觉任务。其开放词汇特性进一步提升了应用价值。然而,这些模型预测过程的黑箱特性及缺乏可解释性,使其在关键领域中的可信度不足。近年来,已有研究尝试强制VLMs为物体识别提供合理的解释依据,但这往往以牺牲分类精度为代价。本文首先基于类别与依据的联合概率分布,提出物体识别任务中可解释性的数学定义,进而利用该定义以可解释方式微调CLIP。通过多个数据集的评估,本方法在可解释分类任务中展现出最先进的性能。值得注意的是,该方法在零样本场景下表现尤为突出,彰显了其适应性。这一进展优化了可解释性物体识别技术,提升了跨领域应用的可信度。代码将在论文发表后公开。