Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Yu-Guan Hsieh,Cheng-Yu Hsieh,Shih-Ying Yeh,Louis Béthune,Hadi Pour Ansari,Pavan Kumar Anasosalu Vasu,Chun-Liang Li,Ranjay Krishna,Oncel Tuzel,Marco Cuturi

from arxiv, 47 pages, 33 figures

Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. The nodes in GBC are created using, in a first stage, object detection and dense captioning tools nested recursively to uncover and describe entity nodes, further linked together in a second stage by highlighting, using new types of nodes, compositions and relations among entities. Since all GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset. We use GBC10M to showcase the wealth of node captions uncovered by GBC, as measured with CLIP training. We show that using GBC nodes' annotations -- notably those stored in composition and relation nodes -- results in significant performance boost on downstream models when compared to other dataset formats. To further explore the opportunities provided by GBC, we also propose a new attention mechanism that can leverage the entire GBC graph, with encouraging experimental results that show the extra benefits of incorporating the graph structure. Our datasets are released at \url{https://huggingface.co/graph-based-captions}.

翻译：人类通过组合性描述复杂场景，使用包含链接与关系的简单文本描述。尽管视觉语言研究致力于开发具备组合理解能力的模型，但现有数据集大多仍使用纯文本描述图像，尚未体现这一特性。本研究提出一种新的标注策略——基于图的描述生成（GBC），其采用带标签的图结构描述图像，图中包含多种类型的节点。GBC节点的构建分为两个阶段：第一阶段通过递归嵌套的目标检测与密集描述工具识别并描述实体节点；第二阶段通过新型节点突显实体间的组合与关系，将实体节点相互连接。由于所有GBC节点均包含纯文本描述，GBC既保留了自然语言的灵活性，又能通过边编码层次化信息。我们通过构建新数据集GBC10M（对CC12M数据集中约1000万张图像进行GBC标注）证明，利用现成的多模态大语言模型与开放词汇检测模型可自动生成GBC。借助CLIP训练评估，我们展示了GBC所揭示的丰富节点描述信息。实验表明，与其他数据集格式相比，使用GBC节点标注（特别是组合节点与关系节点中的标注）能显著提升下游模型性能。为深入探索GBC的潜力，我们还提出一种能利用完整GBC图结构的新型注意力机制，实验结果表明融入图结构能带来额外收益。数据集发布于 \url{https://huggingface.co/graph-based-captions}。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日