The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
翻译:随着基于Transformer的大语言模型日益普及,长序列处理的挑战愈发凸显。在文档视觉问答(DocVQA)领域,主流方法聚焦于单页场景,而实际文档可能包含数百页。我们提出GRAM方法,该方法可将预训练的单页模型无缝扩展至多页场景,且无需高计算量的预训练。具体而言,我们采用单页编码器实现局部页面理解,并通过文档级定制层与可学习标记对其进行增强,从而促进跨页信息流动以支持全局推理。为强制模型利用新引入的文档标记,我们提出一种定制化偏置自适应方法。为在解码阶段进一步节省计算资源,我们引入基于压缩变换器(C-Former)的可选压缩模块,通过缩减编码序列长度实现质量与延迟的权衡。大量实验表明,GRAM在多页DocVQA基准测试中达到领先性能,验证了该方法的有效性。