The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document-level tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our C-Former model, which reduces the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
翻译:基于Transformer的大语言模型日益普及,这带来了处理长序列的挑战。在文档视觉问答(DocVQA)领域,主流方法聚焦于单页面场景,而实际文档可能跨越数百页。我们提出GRAM方法,该方法可将预训练的单页面模型无缝扩展至多页面场景,且无需耗费大量算力的预训练。为此,我们利用单页面编码器进行局部页面级理解,并通过文档级专用层与可学习令牌对其进行增强,促进页面间信息流动以实现全局推理。为了强制模型利用新引入的文档级令牌,我们提出了一种定制的偏置适配方法。为进一步降低解码阶段的计算开销,我们引入了基于C-Former模型的可选压缩阶段,该阶段可缩短编码序列长度,从而在质量与延迟之间实现平衡。大量实验表明,GRAM在多页面DocVQA基准测试中达到了最先进的性能,验证了我们方法的有效性。