Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.
翻译:视觉语言(VL)模型已引起广泛的研究关注,然而其在有效处理图像内文本方面仍面临挑战。为应对这一局限,研究者已开发出两种方法。第一种方法利用外部光学字符识别(OCR)工具从图像中提取文本信息,并将其前置至其他文本输入之前。第二种策略侧重于采用极高分辨率图像以提升文本识别能力。本文聚焦于增强第一种策略,提出了一种名为TAP-VL的新方法,该方法将OCR信息视为独立模态,并将其无缝集成至任意VL模型中。TAP-VL采用基于轻量级Transformer的OCR模块接收带有布局信息的OCR数据,将其压缩为短固定长度序列后输入至大语言模型(LLM)。我们首先在未标注文档上对OCR模块进行与模型无关的预训练,随后通过简短微调将其集成至任意VL架构中。大量实验表明,将TAP-VL应用于当前性能领先的VL模型时,在场景文本和基于文档的VL基准测试中均能实现持续的性能提升。