Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.
翻译:对比训练的视觉-语言模型(如CLIP)在联合图像-文本表示学习方面取得了显著进展,但在组合性理解方面仍面临挑战。这些模型常表现出"词袋"行为——难以捕捉对象关系、属性-对象绑定以及词序依赖。这一局限不仅源于对全局单向量表示的优化依赖,还在于图像-文本配对数据中天然存在的丰富组合性信息未能得到充分挖掘与建模。本文提出MACCO(遮蔽组合概念建模)框架,该模型遮蔽某一模态中的组合概念,并基于另一模态的完整上下文信息对其进行重构,从而更有效地捕捉和对齐跨模态组合结构。为此,我们引入两个辅助目标函数,分别从模态间和模态内对遮蔽特征进行联合对齐与正则化。在五个组合性基准上的大量实验与深入分析表明,本方法不仅显著提升了视觉-语言模型的组合性,还增强了其捕捉句法结构与语言信息的能力。此外,组合性的提升对文本到图像生成及多模态大语言模型亦产生积极影响。代码开源于https://github.com/hiker-lw/MACCO。