The intersection of vision and language is of major interest due to the increased focus on seamless integration between recognition and reasoning. Scene graphs (SGs) have emerged as a useful tool for multimodal image analysis, showing impressive performance in tasks such as Visual Question Answering (VQA). In this work, we demonstrate that despite the effectiveness of scene graphs in VQA tasks, current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images. To address this issue, we introduce the SelfGraphVQA framework. Our approach extracts a scene graph from an input image using a pre-trained scene graph generator and employs semantically-preserving augmentation with self-supervised techniques. This method improves the utilization of graph representations in VQA tasks by circumventing the need for costly and potentially biased annotated data. By creating alternative views of the extracted graphs through image augmentations, we can learn joint embeddings by optimizing the informational content in their representations using an un-normalized contrastive approach. As we work with SGs, we experiment with three distinct maximization strategies: node-wise, graph-wise, and permutation-equivariant regularization. We empirically showcase the effectiveness of the extracted scene graph for VQA and demonstrate that these approaches enhance overall performance by highlighting the significance of visual information. This offers a more practical solution for VQA tasks that rely on SGs for complex reasoning questions.
翻译:视觉与语言的交叉领域因识别与推理的无缝集成需求而备受关注。场景图作为多模态图像分析的有效工具,在视觉问答等任务中展现出卓越性能。本研究表明,尽管场景图在VQA任务中具有有效性,但当前使用理想化标注场景图的方法,在应用从图像中提取的预测场景图时难以泛化。针对该问题,我们提出SelfGraphVQA框架。该方法通过预训练场景图生成器从输入图像中提取场景图,并采用语义保持增强与自监督技术,无需依赖代价高昂且可能带有偏见的标注数据即可提升图表示在VQA任务中的利用率。通过对提取的图进行图像增强以创建替代视图,我们利用非归一化对比方法优化其表示中的信息内容,从而学习联合嵌入。在操作场景图时,我们实验了三种不同的最大化策略:节点级、图级和置换等变正则化。实验验证了提取的场景图在VQA中的有效性,并表明这些方法通过凸显视觉信息的重要性提升了整体性能。这为依赖场景图处理复杂推理问题的VQA任务提供了更实用的解决方案。