Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.
翻译:多模态(视觉-语言)模型,如CLIP,正逐步取代传统监督预训练模型(例如基于ImageNet的预训练),成为新一代视觉基础模型。这些模型从数十亿互联网图文对中学习到鲁棒且对齐的语义表征,能够以零样本方式应用于各类下游任务。然而,在医学影像和遥感等细粒度领域,多模态基础模型的性能往往不尽如人意。为此,众多研究者开始探索这些模型的少样本适应方法,逐步形成了三种主要技术路线:1)基于提示的方法,2)基于适配器的方法,以及3)基于外部知识的方法。尽管如此,这一快速发展领域已产生大量研究成果,却缺乏系统性综述来梳理研究进展。因此,本综述对多模态模型少样本适应方法的研究进展进行了归纳与分析,总结了常用数据集与实验设置,并对比了不同方法的结果。此外,针对现有方法缺乏可靠理论支撑的问题,我们推导了多模态模型少样本适应的泛化误差上界。该定理表明,多模态基础模型的泛化误差受三个因素制约:领域差距、模型容量和样本数量。基于此,我们从以下三个方面提出了三种可能的解决方案:1)自适应领域泛化,2)自适应模型选择,3)自适应知识利用。