Concept Bottleneck Models (CBM) map the input image to a high-level human-understandable concept space and then make class predictions based on these concepts. Recent approaches automate the construction of CBM by prompting Large Language Models (LLM) to generate text concepts and then use Vision Language Models (VLM) to obtain concept scores to train a CBM. However, it is desired to build CBMs with concepts defined by human experts instead of LLM generated concepts to make them more trustworthy. In this work, we take a closer inspection on the faithfulness of VLM concept scores for such expert-defined concepts in domains like fine-grain bird species classification and animal classification. Our investigations reveal that frozen VLMs, like CLIP, struggle to correctly associate a concept to the corresponding visual input despite achieving a high classification performance. To address this, we propose a novel Contrastive Semi-Supervised (CSS) learning method which uses a few labeled concept examples to improve concept alignment (activate truthful visual concepts) in CLIP model. Extensive experiments on three benchmark datasets show that our approach substantially increases the concept accuracy and classification accuracy, yet requires only a fraction of the human-annotated concept labels. To further improve the classification performance, we also introduce a new class-level intervention procedure for fine-grain classification problems that identifies the confounding classes and intervenes their concept space to reduce errors.
翻译:概念瓶颈模型(CBM)将输入图像映射到高层次的、人类可理解的概念空间,并基于这些概念进行类别预测。近期方法通过提示大语言模型(LLM)自动生成文本概念,并利用视觉语言模型(VLM)获取概念分数来训练CBM,从而实现了CBM的自动化构建。然而,为使CBM更可靠,理想的情况是采用由人类专家定义的概念而非LLM生成的概念。本研究针对细粒度鸟类分类和动物分类等场景,深入考察了VLM概念分数对这类专家定义概念的忠实性。我们的研究发现,尽管冻结的VLM(如CLIP)具有较高的分类性能,但其在将概念正确关联到相应视觉输入方面存在困难。为解决此问题,我们提出了一种新颖的对比半监督(CSS)学习方法,该方法利用少量标注概念示例来改进CLIP模型中的概念对齐(激活真实的视觉概念)。在三个基准数据集上的大量实验表明,我们的方法显著提升了概念准确率和分类准确率,且仅需少量人工标注的概念标签。为进一步提升分类性能,我们还针对细粒度分类问题提出了一种新的类别级干预流程,该流程能识别混淆类别并干预其概念空间以减少错误。