Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

翻译：基于文本的视觉问答（TextVQA）旨在回答关于图像中文本的问题。该领域大多数工作聚焦于设计网络结构或预训练任务。这些方法均按阅读顺序（从左到右、从上到下）排列OCR文本以形成序列，并将其视为自然语言“句子”。然而，它们忽略了TextVQA任务中大多数OCR词汇并不具备语义上下文关联这一事实。此外，这些方法使用一维位置嵌入来顺序构建OCR标记间的空间关系，这并不合理。一维位置嵌入只能表示句子中词汇间的左右序列关系，而无法表示复杂的空间位置关系。为解决上述问题，我们提出一种名为分离与定位（Separate and Locate, SaL）的新方法，该方法探索文本上下文线索并设计空间位置嵌入以构建OCR文本间的空间关系。具体而言，我们提出文本语义分离（Text Semantic Separate, TSS）模块，帮助模型识别词汇是否具有语义上下文关联；随后引入空间环形位置（Spatial Circle Position, SCP）模块，帮助模型更有效地构建和推理OCR文本间的空间位置关系。我们的SaL模型在TextVQA和ST-VQA数据集上相较于基线模型分别提升4.44%和3.96%的准确率。与基于6400万预训练样本的预训练最先进方法相比，我们的方法无需任何预训练任务，仍在TextVQA和ST-VQA上分别实现2.68%和2.52%的准确率提升。我们的代码与模型将发布于https://github.com/fangbufang/SaL。