COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.

翻译：摘要：组合推理是人类视觉智能的标志性能力；然而，尽管大规模视觉-语言模型规模庞大，它们在将物体与其属性结合来表示简单组合方面仍存在困难。为了衡量这种组合能力的缺失，我们设计了Cola，一个用于组合具有属性定位的物体的文本到图像检索基准。以Cola作为测试平台，我们探索了多种建模设计，以调整预训练的视觉-语言模型，使其能够对多个物体附带的多个属性进行组合推理。我们在两个开创性视觉-语言模型上研究了6种微调策略，使用了3个微调数据集和2个测试基准（Cola和CREPE）。令人惊讶的是，我们最优的微调策略使一个1.51亿参数的CLIP模型（其在预训练中分别编码图像和语言）的性能达到与一个2.41亿参数的FLAVA模型（其在预训练中使用多模态Transformer编码器同时关注视觉和语言模态）相当的水平。这一最优微调策略是一个轻量级的多模态适配器，它能够联合关注预训练模型生成的图像和语言特征。我们证明，这种方法比常见的策略（如提示微调或微调可比的单模态层数）效果更佳。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

47+阅读 · 2020年10月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日