Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grain spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into a human-like spatial reasoning process to capture the contextual knowledge of key objects step-by-step. %we formulate a human-like spatial reasoning process by introducing 3D geometric information for capturing key objects' contextual knowledge. To enhance the model's understanding of 3D spatial relationships, Specifically, (i)~we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii)~we design a depth-aware attention calibration module for calibrating the OCR tokens' attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7\% and 12.1\% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.

翻译：基于文本的视觉问答旨在为包含多个场景文本的图像生成正确答案。由于文本通常自然附着于物体表面，文本与物体间的空间推理对TextVQA至关重要。然而，现有方法仅能利用从输入图像中获取的二维空间信息，并依赖基于Transformer的架构在融合过程中进行隐式推理。在此设定下，这些二维空间推理方法无法区分同一图像平面上视觉对象与场景文本间的细粒度空间关系，从而损害了TextVQA模型的可解释性与性能。本文通过引入三维几何信息到类人空间推理过程，逐步捕获关键物体的上下文知识。具体而言：（i）我们提出关系预测模块，用于精准定位关键物体的感兴趣区域；（ii）我们设计深度感知注意力校准模块，根据关键物体校准OCR令牌的注意力分布。大量实验表明，我们的方法在TextVQA与ST-VQA数据集上达到了最优性能。更令人振奋的是，在TextVQA与ST-VQA验证集中涉及空间推理的问题上，我们的模型分别以5.7%和12.1%的显著优势超越其他方法。此外，我们还在基于文本的图像描述任务中验证了模型的泛化能力。