Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks. The code is available at: https://github.com/lqh52/PromViL.
翻译:现有的大型视觉语言模型(LVLMs)在多模态输入的跨模态概念匹配方面表现出色,但在处理组合概念及实体间高层次关系方面仍存在困难。本文提出渐进式多粒度视觉语言对齐(PromViL),这是一个旨在增强LVLMs执行具身组合视觉推理任务能力的新框架。我们的方法构建了一个从简单到复杂概念的多模态对齐层次结构。通过逐步将文本描述与对应的视觉区域对齐,我们的模型学习利用来自较低层级的上下文信息来支持更高层级的推理。为了促进这一学习过程,我们引入了一种数据生成流程,该流程基于Visual Genome创建了一个包含广泛嵌套组合视觉-语言对的新型数据集。实验结果表明,我们的PromViL框架在各种视觉定位和组合问答任务上显著优于基线模型。代码已发布于:https://github.com/lqh52/PromViL。