In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.
翻译:近年来,视觉文档理解领域取得了显著进展,主流架构通常由级联的视觉与语言模型构成。在基于OCR的方法中,文本组件可通过外部OCR模型显式提取;而在无OCR方法中,视觉模型则被赋予阅读能力。传统方法通常将查询指令仅输入语言组件,这要求视觉特征必须覆盖整个文档。本文提出VisFocus——一种无OCR方法,通过将视觉编码器与语言提示直接耦合,以更有效地利用视觉编码器的能力。为此,我们将下采样层替换为可接收输入提示的新型层,使其能够高亮文档相关部分并忽略无关内容。我们结合架构改进提出了一种新颖的预训练任务:将文档片段文本进行语言掩码后输入视觉编码器以替代提示,从而赋予模型聚焦能力。因此,VisFocus学会将注意力分配到与给定提示相关的文本片段上。实验表明,这种提示引导的视觉编码方法显著提升了性能,在多个基准测试中取得了最先进的结果。