Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG). However, such unstructured caption data and its processing are troubling the learning an acurrate and complete scene graph. This dilema can be summarized as three points. First, traditional language parsers often fail to extract meaningful relationship triplets from caption data. Second, grounding unlocalized objects in parsed triplets will meet ambiguity in visual-language alignment. Last, caption data typically are sparse and exhibit bias to partial observations of image content. These three issues make it hard for the model to generate comprehensive and accurate scene graphs. To fill this gap, we propose a simple yet effective framework, GPT4SGG, to synthesize scene graphs from holistic and region-specific narratives. The framework discards traditional language parser, and localize objects before obtaining relationship triplets. To obtain relationship triplets, holistic and dense region-specific narratives are generated from the image. With such textual representation of image data and a task-specific prompt, an LLM, particularly GPT-4, directly synthesizes a scene graph as "pseudo labels". Experimental results showcase GPT4SGG significantly improves the performance of SGG models trained on image-caption data. We believe this pioneering work can motivate further research into mining the visual reasoning capabilities of LLMs.
翻译:从自然语言描述中学习场景图已被证明是一种廉价且前景可观的场景图生成(SGG)方案。然而,此类非结构化描述数据及其处理方式阻碍了准确且完整的场景图的学习。这一困境可归纳为三点:首先,传统语言解析器通常难以从描述数据中提取有意义的关系三元组;其次,将解析后三元组中未定位的对象进行视觉-语言对齐时会产生歧义;最后,描述数据通常稀疏且对图像内容的局部观测存在偏差。这三个问题导致模型难以生成全面且准确的场景图。为填补这一空白,我们提出了一种简单而有效的框架GPT4SGG,该框架通过整体与区域特定叙述来合成场景图。该框架摒弃了传统语言解析器,在获取关系三元组之前先定位对象。为获取关系三元组,我们从图像生成整体且密集的区域特定叙述。凭借此类图像数据的文本表示及任务特定提示,大型语言模型(LLM,尤其是GPT-4)可直接将场景图合成为“伪标签”。实验结果表明,GPT4SGG显著提升了基于图像-描述数据训练的SGG模型的性能。我们相信,这项开创性工作可激励进一步探索大语言模型的视觉推理能力。