Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT.
翻译:对比型视觉-语言模型(如CLIP)通常通过对比训练更新视觉模型和语言模型的所有参数来构建。能否通过仅对已训练的视觉模型和语言模型进行少量参数更新来构建此类模型?现有文献描述了通过仅更新语言模型少量参数来构建视觉-语言模型的技术,但这些技术要求视觉表征已预先对齐,且采用非对比方法,因此无法用于神经搜索等对延迟敏感的应用。本文探索了通过迁移学习实现参数高效对比视觉-语言对齐的可行性与优势:即通过最小程度地更新已训练的视觉和语言模型来构建类似CLIP的模型。我们发现,仅需更新最少量的参数(<7%)即可达到与全模型训练相同的性能,而更新特定组件(<1%的参数)即可匹配全模型训练75%的性能。我们通过一系列实验证明:参数高效训练能更强地保持已有知识,且其扩展性随模型和数据集规模递增。在配对图像文本数据稀缺但存在强大的多语言语言模型(如低资源语言)时,参数高效训练甚至优于全模型训练。在固定计算预算下,参数高效训练允许在相同硬件上训练更大模型,以更短时间达到同等性能。因此,参数高效训练构成了一种高能效且高效的对比型视觉-语言模型训练策略,在常见应用场景中可能优于全模型训练范式。代码与权重详见:https://github.com/codezakh/LilT。