In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.
翻译:本文提出概念码本学习(CoCoLe),这是一种用于视觉语言模型(VLMs)的新型微调方法,旨在解决在少样本场景下对下游任务进行微调时提升VLMs泛化能力的挑战。我们认识到纹理、形状、颜色等视觉概念天然具有跨领域可迁移性,在泛化任务中起着关键作用。基于这一重要发现,我们学习一个由视觉概念作为键、概念提示作为值构成的概念码本,该码本在图像编码器输出与文本编码器输入之间建立连接。具体而言,对于给定图像,我们利用码本识别与类别嵌入最相关的概念提示以执行分类。此外,我们引入手工构建的概念缓存作为正则化手段,以缓解低样本场景中的过拟合问题。实验表明,这种概念码本学习方法能够有效增强视觉模态与语言模态之间的对齐。大量实验结果证明,我们的CoCoLe方法在多种评估场景(包括基类到新类的泛化、跨数据集评估及领域泛化任务)中显著优于现有最先进方法。详细的消融研究进一步验证了CoCoLe各组成部分的有效性。