State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.
翻译:当前最先进的视觉-语言模型(VLM)在结构知识提取(如物体间关系)方面仍存在性能瓶颈。本文提出ViStruct训练框架,旨在提升VLM对视觉结构知识的有效提取能力。该框架包含两项创新设计:第一,利用编程语言内在的结构性来描述视觉结构信息,从而以规范化的结构化格式对多粒度视觉结构信息(如概念、关系、事件)进行显式且一致的表征;第二,引入基于课程的学习方法,使VLM能够从基础视觉概念到复杂事件结构逐步理解视觉结构。我们的直觉是,较低层次的知识可能有助于复杂视觉结构的理解。此外,我们整理并发布了适用于视觉结构知识提取的系列数据集,采用弱监督方法直接从描述文本生成视觉事件结构用于ViStruct训练,充分利用互联网上海量的图像-描述文本对。实验阶段,我们在视觉结构预测任务中对ViStruct进行评测,验证了其在提升视觉结构理解方面的有效性。代码已开源至\url{https://github.com/Yangyi-Chen/vi-struct}。