Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. The nodes in GBC are created using, in a first stage, object detection and dense captioning tools nested recursively to uncover and describe entity nodes, further linked together in a second stage by highlighting, using new types of nodes, compositions and relations among entities. Since all GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset. We use GBC10M to showcase the wealth of node captions uncovered by GBC, as measured with CLIP training. We show that using GBC nodes' annotations -- notably those stored in composition and relation nodes -- results in significant performance boost on downstream models when compared to other dataset formats. To further explore the opportunities provided by GBC, we also propose a new attention mechanism that can leverage the entire GBC graph, with encouraging experimental results that show the extra benefits of incorporating the graph structure. Our datasets are released at \url{https://huggingface.co/graph-based-captions}.
翻译:人类通过组合性描述复杂场景,使用包含链接与关系的简单文本描述。尽管视觉语言研究致力于开发具备组合理解能力的模型,但现有数据集大多仍使用纯文本描述图像,尚未体现这一特性。本研究提出一种新的标注策略——基于图的描述生成(GBC),其采用带标签的图结构描述图像,图中包含多种类型的节点。GBC节点的构建分为两个阶段:第一阶段通过递归嵌套的目标检测与密集描述工具识别并描述实体节点;第二阶段通过新型节点突显实体间的组合与关系,将实体节点相互连接。由于所有GBC节点均包含纯文本描述,GBC既保留了自然语言的灵活性,又能通过边编码层次化信息。我们通过构建新数据集GBC10M(对CC12M数据集中约1000万张图像进行GBC标注)证明,利用现成的多模态大语言模型与开放词汇检测模型可自动生成GBC。借助CLIP训练评估,我们展示了GBC所揭示的丰富节点描述信息。实验表明,与其他数据集格式相比,使用GBC节点标注(特别是组合节点与关系节点中的标注)能显著提升下游模型性能。为深入探索GBC的潜力,我们还提出一种能利用完整GBC图结构的新型注意力机制,实验结果表明融入图结构能带来额外收益。数据集发布于 \url{https://huggingface.co/graph-based-captions}。