Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.
翻译:近期研究表明,视觉语言模型(VLMs)难以完全理解人类语言的组合特性,通常将图像描述建模为"词袋"。因此,它们在组合任务上表现不佳——这类任务需要深入理解句子的不同实体(主语、动词等)及其相互关系才能解决。本文提出使用因果图模型(CGM)对文本和视觉标记之间的依赖关系进行建模,该模型通过依存句法分析器构建,并训练了一个以VLM视觉编码器为条件的解码器。与标准的自回归或并行预测不同,我们的解码器生成过程遵循CGM结构进行部分排序。这种结构促使解码器仅学习句子中的主要因果依赖关系,而忽略虚假相关性。通过在五个组合基准测试上的大量实验,我们证明该方法显著优于所有最先进的组合方法,且其性能提升幅度较大,甚至超越了使用更大规模数据集训练的方法。