With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named \textbf{LAMM}, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(\%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.
翻译:随着预训练视觉-语言模型(如CLIP)在视觉表征任务中取得成功,将预训练模型迁移到下游任务已成为关键范式。近期,受自然语言处理领域启发的提示调优范式在视觉-语言领域取得了显著进展。然而,现有方法主要集中于构建文本和视觉输入的提示模板,忽视了类标签表示在视觉-语言模型与下游任务之间的差异。为应对这一挑战,我们提出了一种创新的标签对齐方法,命名为LAMM,该方法能够通过端到端训练动态调整下游数据集的类别嵌入。此外,为实现更合理的标签分布,我们提出了层级损失函数,涵盖参数空间、特征空间和逻辑空间的联合对齐。我们在11个下游视觉数据集上进行了实验,结果表明,我们的方法在少样本场景下显著提升了现有跨模态提示学习模型的性能,在16样本设置下平均准确率较最先进方法提升2.31%。同时,与其他提示调优方法相比,本方法在持续学习任务中展现出卓越性能。重要的是,我们的方法与现有提示调优方法具有协同性,可进一步强化其性能。相关代码与数据集将在https://github.com/gaojingsheng/LAMM公开。