Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.
翻译:视觉问答(VQA)与图像描述(CAP)作为最流行的视觉-语言任务之一,均存在需要基于图像中文本进行推理的场景文本版本。尽管两者具有明显的相似性,但现有研究将它们独立处理,且如我们所示,由此产生的任务特定方法只能“看”或只能“读”,无法兼顾二者。本研究深入剖析了这一现象,并提出UniTNT(统一文本-非文本方法),赋予现有多模态架构场景文本理解能力。具体而言,我们将场景文本信息视为一种额外模态,通过特定模块将其与任何基于预训练编码器-解码器的架构相融合。大量实验表明,UniTNT首次实现了单一模型成功兼顾两类任务。此外,我们发现场景文本理解能力可分别将通用VQA与CAP的性能提升高达2.69%与0.6 CIDEr。