Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.
翻译:大多数TextVQA方法仅通过简单的Transformer编码器整合物体、场景文本和问题词,但这无法捕捉不同模态间的语义关系。本文提出了一种基于场景图的协同注意力网络(SceneGATE)用于TextVQA,该方法揭示了物体、光学字符识别(OCR)标记与问题词之间的语义关系。其核心是通过一个基于TextVQA的场景图来发掘图像的潜在语义。我们设计了一个引导注意力模块,用于捕捉语言与视觉模态内部的相互作用,从而为跨模态交互提供引导。为显式教授两种模态之间的关系,我们提出并整合了两个注意力模块:基于场景图的语义关系感知注意力与位置关系感知注意力。我们在两个基准数据集(Text-VQA和ST-VQA)上进行了大量实验。结果表明,由于场景图及其注意力模块的引入,我们的SceneGATE方法性能优于现有方法。