Document Visual Question Answering (DVQA) is a task that involves responding to queries based on the content of images. Existing work is limited to locating information within a single page and does not facilitate cross-page question-and-answer interaction. Furthermore, the token length limitation imposed on inputs to the model may lead to truncation of segments pertinent to the answer. In this study, we introduce a simple but effective methodology called CFRet-DVQA, which focuses on retrieval and efficient tuning to address this critical issue effectively. For that, we initially retrieve multiple segments from the document that correlate with the question at hand. Subsequently, we leverage the advanced reasoning abilities of the large language model (LLM), further augmenting its performance through instruction tuning. This approach enables the generation of answers that align with the style of the document labels. The experiments demonstrate that our methodology achieved state-of-the-art or competitive results with both single-page and multi-page documents in various fields.
翻译:文档视觉问答(DVQA)是一项基于图像内容回答查询的任务。现有工作局限于单页信息定位,无法实现跨页问答交互。此外,模型输入的令牌长度限制可能导致与答案相关的片段被截断。本研究提出一种名为CFRet-DVQA的简洁有效方法,聚焦于检索与高效调优以有效解决这一关键问题。具体而言,我们首先从文档中检索与当前问题相关的多个片段,随后利用大语言模型(LLM)的先进推理能力,并通过指令调优进一步增强其性能。该方法能够生成与文档标签风格一致的答案。实验表明,我们的方法在单页及多页文档的多个领域均达到了最优或具有竞争力的结果。