Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.
翻译:基于文本的视觉问答(TextVQA)旨在回答关于图像中文本的问题。该领域大多数工作聚焦于设计网络结构或预训练任务。这些方法均按阅读顺序(从左到右、从上到下)排列OCR文本以形成序列,并将其视为自然语言“句子”。然而,它们忽略了TextVQA任务中大多数OCR词汇并不具备语义上下文关联这一事实。此外,这些方法使用一维位置嵌入来顺序构建OCR标记间的空间关系,这并不合理。一维位置嵌入只能表示句子中词汇间的左右序列关系,而无法表示复杂的空间位置关系。为解决上述问题,我们提出一种名为分离与定位(Separate and Locate, SaL)的新方法,该方法探索文本上下文线索并设计空间位置嵌入以构建OCR文本间的空间关系。具体而言,我们提出文本语义分离(Text Semantic Separate, TSS)模块,帮助模型识别词汇是否具有语义上下文关联;随后引入空间环形位置(Spatial Circle Position, SCP)模块,帮助模型更有效地构建和推理OCR文本间的空间位置关系。我们的SaL模型在TextVQA和ST-VQA数据集上相较于基线模型分别提升4.44%和3.96%的准确率。与基于6400万预训练样本的预训练最先进方法相比,我们的方法无需任何预训练任务,仍在TextVQA和ST-VQA上分别实现2.68%和2.52%的准确率提升。我们的代码与模型将发布于https://github.com/fangbufang/SaL。