Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.
翻译:摘要:组合推理是人类视觉智能的标志性能力;然而,尽管大规模视觉-语言模型规模庞大,它们在将物体与其属性结合来表示简单组合方面仍存在困难。为了衡量这种组合能力的缺失,我们设计了Cola,一个用于组合具有属性定位的物体的文本到图像检索基准。以Cola作为测试平台,我们探索了多种建模设计,以调整预训练的视觉-语言模型,使其能够对多个物体附带的多个属性进行组合推理。我们在两个开创性视觉-语言模型上研究了6种微调策略,使用了3个微调数据集和2个测试基准(Cola和CREPE)。令人惊讶的是,我们最优的微调策略使一个1.51亿参数的CLIP模型(其在预训练中分别编码图像和语言)的性能达到与一个2.41亿参数的FLAVA模型(其在预训练中使用多模态Transformer编码器同时关注视觉和语言模态)相当的水平。这一最优微调策略是一个轻量级的多模态适配器,它能够联合关注预训练模型生成的图像和语言特征。我们证明,这种方法比常见的策略(如提示微调或微调可比的单模态层数)效果更佳。