Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.
翻译:现有的大型视觉语言模型(LVLMs)在多模态输入的概念匹配方面表现出色,但在处理组合概念及实体间高层级关系方面仍存在困难。本文提出渐进式多粒度视觉语言对齐(PromViL)框架,这是一种旨在增强LVLMs执行具身组合视觉推理任务能力的新颖方法。我们的方法构建了从简单到复杂概念的多模态对齐层次结构。通过逐步将文本描述与对应视觉区域对齐,模型能够学习利用较低层级的上下文信息来指导高层级推理。为促进这一学习过程,我们引入了一种基于Visual Genome衍生新型数据集的数据生成流程,该数据集提供了广泛嵌套的组合视觉-语言对。实验结果表明,我们的PromViL框架在多种视觉定位和组合问答任务上显著优于基线模型。