Various adaptation methods, such as LoRA, prompts, and adapters, have been proposed to enhance the performance of pre-trained vision-language models in specific domains. The robustness of these adaptation methods against distribution shifts have not been studied. In this study, we assess the robustness of 11 widely-used adaptation methods across 4 vision-language datasets under multimodal corruptions. Concretely, we introduce 7 benchmark datasets, including 96 visual and 87 textual corruptions, to investigate the robustness of different adaptation methods, the impact of available adaptation examples, and the influence of trainable parameter size during adaptation. Our analysis reveals that: 1) Adaptation methods are more sensitive to text corruptions than visual corruptions. 2) Full fine-tuning does not consistently provide the highest robustness; instead, adapters can achieve better robustness with comparable clean performance. 3) Contrary to expectations, our findings indicate that increasing the number of adaptation data and parameters does not guarantee enhanced robustness; instead it results in even lower robustness. We hope this study could benefit future research in the development of robust multimodal adaptation methods. The benchmark, code, and dataset used in this study can be accessed at https://adarobustness.github.io .
翻译:为提升预训练视觉-语言模型在特定领域的性能,已有多种适应性方法被提出,例如低秩适配(LoRA)、提示学习(prompts)以及适配器(adapters)。然而,这些适应性方法在分布偏移下的鲁棒性尚未得到系统研究。在本研究中,我们针对多模态扰动下的4个视觉-语言数据集,评估了11种广泛使用的适应性方法的鲁棒性。具体而言,我们引入了7个基准数据集,包含96种视觉扰动和87种文本扰动,以探究不同适应性方法的鲁棒性、可用适应性样本量的影响,以及适应性过程中可训练参数量大小的影响。分析结果表明:1)适应性方法对文本扰动的敏感度高于视觉扰动;2)全参数微调并非始终提供最高的鲁棒性,相反,适配器可在保持相当干净性能的同时实现更优的鲁棒性;3)与预期相反,我们的研究发现增加适应性数据量和参数量并不能保证鲁棒性提升,反而导致鲁棒性降低。本研究期望能推动未来在鲁棒多模态适应性方法开发方面的相关研究。本研究所使用的基准测试、代码及数据集可通过 https://adarobustness.github.io 获取。