In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.
翻译:本文提出概念码本学习(CoCoLe),这是一种用于视觉语言模型(VLM)的新型微调方法,旨在解决在少样本设置下对下游任务进行微调时,如何提升VLM泛化能力的挑战。我们认识到,诸如纹理、形状和颜色等视觉概念天然具有跨领域可迁移性,并在泛化任务中起着关键作用。受这一有趣发现的启发,我们学习一个概念码本,其中以视觉概念为键、概念提示为值,该码本作为图像编码器输出与文本编码器输入之间的桥梁。具体而言,对于给定图像,我们利用该码本识别与类别嵌入最相关的概念提示以执行分类。此外,我们引入了一个手工构建的概念缓存作为正则化项,以缓解低样本场景中的过拟合问题。我们观察到,这种概念码本学习方法能够增强视觉模态与语言模态之间的对齐。大量实验结果表明,我们的CoCoLe方法在各种评估设置中显著优于现有的最先进方法,包括基础到新类别的泛化、跨数据集评估以及领域泛化任务。详细的消融研究进一步证实了CoCoLe中每个组件的有效性。