Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.

翻译：文本视觉问答（TextVQA）面临避免冗余关系推理的重大挑战。具体而言，大量检测到的目标与光学字符识别（OCR）标记导致视觉关系过于丰富。现有方法将所有视觉关系纳入答案预测的考量范围。然而，我们提出三点观察：(1) 图像中的单一主语常被检测为带有不同边界框的多个重复目标，这些重复目标之间的关联对于答案推理而言是冗余的；(2) 图像中检测到的两个空间距离较远的OCR标记通常对答案推理仅有较弱的语义依赖性；(3) 邻近目标与标记的共存可能蕴含预测答案的重要视觉线索。不同于利用所有关系进行答案预测，我们致力于识别最关键的联系或剔除冗余关系。本文提出稀疏空间图网络（SSGN），引入空间感知关系剪枝技术。作为关系度量的空间因子，我们采用空间距离、几何尺寸、重叠面积及DIoU进行空间感知剪枝。在图学习过程中，我们考虑三种视觉关系：目标-目标、OCR标记-OCR标记以及目标-OCR标记关系。SSGN是一种渐进式图学习架构：首先验证关联目标-标记稀疏图中的关键关系，进而分别在基于目标的稀疏图和基于标记的稀疏图中进行验证。在TextVQA与ST-VQA数据集上的实验结果表明，SSGN取得了优异的性能。可视化结果进一步证明了该方法具有可解释性。