Vision-Language Models (VLMs) achieved strong performance on a variety of tasks (e.g., image-text retrieval, visual question answering). However, most VLMs rely on coarse-grained image-caption pairs for alignment, relying on data volume to resolve ambiguities and ground linguistic concepts in images. The richer semantic and syntactic structure within text is largely overlooked. To address this, we propose HIerarchically STructured Learning (HIST) that enhances VLM training without any additional supervision, by hierarchically decomposing captions into the constituent Subject, Noun Phrases, and Composite Phrases. Entailment between these constituent components allows us to formulate additional regularization constraints on the VLM attention maps. Specifically, we introduce two novel loss functions: (1) Subject Loss, which aligns image content with the subject of corresponding phrase, acting as an entailment of standard contrastive/matching losses at the Phrase level; (2) Addition Loss, to balance attention across multiple objects. HIST is general, and can be applied to any VLM for which attention between vision and language can be computed; we illustrate its efficacy on BLIP and ALBEF. HIST outperforms baseline VLMs, achieving up to +9.8% improvement in visual grounding, +6.3% in multi-object referring segmentation, +1.1% in image-text retrieval, and +0.2% in visual question answering, underscoring the value of structuring learning in VLMs.
翻译:视觉语言模型(VLM)在多种任务(如图文检索、视觉问答)上取得了强劲性能。然而,大多数VLM依赖粗粒度的图像-标题对进行对齐,依靠数据量来解决歧义并将语言概念锚定于图像。文本中更丰富的语义与句法结构在很大程度上被忽视了。为解决此问题,我们提出层次化结构化学习(HIST),该方法通过将标题层次化分解为构成性主语、名词短语和复合短语,在无需任何额外监督的情况下增强VLM训练。这些构成成分之间的蕴含关系使我们能够在VLM注意力图上制定额外的正则化约束。具体而言,我们引入了两种新颖的损失函数:(1)主语损失,它将图像内容与对应短语的主语对齐,在短语层面作为标准对比/匹配损失的蕴含;(2)加法损失,用于平衡多个对象间的注意力。HIST具有通用性,可应用于任何能够计算视觉与语言间注意力的VLM;我们在BLIP和ALBEF上验证了其有效性。HIST超越了基线VLM,在视觉定位上实现了高达+9.8%的提升,在多对象指代分割上提升+6.3%,在图文检索上提升+1.1%,在视觉问答上提升+0.2%,这凸显了在VLM中进行结构化学习的价值。