The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
翻译:视觉与语言(V&L)模型通常缺乏识别和推理视觉输入中嵌入文本的能力,这可能是因为V&L预训练方法常未能将此类能力纳入训练目标。本文提出PreSTU,一种专为场景文本理解(STU)设计的新型预训练方案。PreSTU引入了具有OCR感知的预训练目标,鼓励模型从图像中识别文本并将其与图像其余内容建立关联。我们采用基于Transformer的简单编码器-解码器架构实现PreSTU,并结合从现成OCR系统获取的大规模含场景文本的图像-文本数据集。通过在八个视觉问答和四个图像描述基准上的实验,我们实证证明了该预训练方法的有效性。