The pre-training-fine-tuning paradigm based on layout-aware multimodal pre-trained models has achieved significant progress on document image question answering. However, domain pre-training and task fine-tuning for additional visual, layout, and task modules prevent them from directly utilizing off-the-shelf instruction-tuning language foundation models, which have recently shown promising potential in zero-shot learning. Contrary to aligning language models to the domain of document image question answering, we align document image question answering to off-the-shell instruction-tuning language foundation models to utilize their zero-shot capability. Specifically, we propose layout and task aware instruction prompt called LATIN-Prompt, which consists of layout-aware document content and task-aware descriptions. The former recovers the layout information among text segments from OCR tools by appropriate spaces and line breaks. The latter ensures that the model generates answers that meet the requirements, especially format requirements, through a detailed description of task. Experimental results on three benchmarks show that LATIN-Prompt can improve the zero-shot performance of instruction-tuning language foundation models on document image question answering and help them achieve comparable levels to SOTAs based on the pre-training-fine-tuning paradigm. Quantitative analysis and qualitative analysis demonstrate the effectiveness of LATIN-Prompt. We provide the code in supplementary and will release the code to facilitate future research.
翻译:基于布局感知的多模态预训练模型的预训练-微调范式在文档图像问答任务上取得了显著进展。然而,领域预训练和任务微调需要引入额外的视觉、布局和任务模块,这阻碍了它们直接利用现成的指令微调语言基础模型——而这类模型近期在零样本学习中展现出巨大潜力。与将语言模型适配到文档图像问答领域不同,我们逆向将文档图像问答适配到现成的指令微调语言基础模型,以利用其零样本能力。具体而言,我们提出了名为LATIN-Prompt的布局与任务感知指令提示,它由布局感知的文档内容和任务感知的描述组成。前者通过适当的空格和换行符恢复来自OCR工具的文本片段间的布局信息;后者通过任务详细描述确保模型生成符合要求(特别是格式要求)的答案。在三个基准数据集上的实验结果表明,LATIN-Prompt能够提升指令微调语言基础模型在文档图像问答上的零样本性能,并使其达到与基于预训练-微调范式的最先进方法相当的水平。定量与定性分析共同验证了LATIN-Prompt的有效性。我们在补充材料中提供了代码,并将在未来公开代码以促进后续研究。