Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.

翻译：多模态（视觉-语言）模型，如CLIP，正逐步取代传统监督预训练模型（例如基于ImageNet的预训练），成为新一代视觉基础模型。这些模型从数十亿互联网图文对中学习到鲁棒且对齐的语义表征，能够以零样本方式应用于各类下游任务。然而，在医学影像和遥感等细粒度领域，多模态基础模型的性能往往不尽如人意。为此，众多研究者开始探索这些模型的少样本适应方法，逐步形成了三种主要技术路线：1）基于提示的方法，2）基于适配器的方法，以及3）基于外部知识的方法。尽管如此，这一快速发展领域已产生大量研究成果，却缺乏系统性综述来梳理研究进展。因此，本综述对多模态模型少样本适应方法的研究进展进行了归纳与分析，总结了常用数据集与实验设置，并对比了不同方法的结果。此外，针对现有方法缺乏可靠理论支撑的问题，我们推导了多模态模型少样本适应的泛化误差上界。该定理表明，多模态基础模型的泛化误差受三个因素制约：领域差距、模型容量和样本数量。基于此，我们从以下三个方面提出了三种可能的解决方案：1）自适应领域泛化，2）自适应模型选择，3）自适应知识利用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日