Document question answering is a task of question answering on given documents such as reports, slides, pamphlets, and websites, and it is a truly demanding task as paper and electronic forms of documents are so common in our society. This is known as a quite challenging task because it requires not only text understanding but also understanding of figures and tables, and hence visual question answering (VQA) methods are often examined in addition to textual approaches. We introduce Japanese Document Question Answering (JDocQA), a large-scale document-based QA dataset, essentially requiring both visual and textual information to answer questions, which comprises 5,504 documents in PDF format and annotated 11,600 question-and-answer instances in Japanese. Each QA instance includes references to the document pages and bounding boxes for the answer clues. We incorporate multiple categories of questions and unanswerable questions from the document for realistic question-answering applications. We empirically evaluate the effectiveness of our dataset with text-based large language models (LLMs) and multimodal models. Incorporating unanswerable questions in finetuning may contribute to harnessing the so-called hallucination generation.
翻译:文档问答是一项针对给定文档(如报告、幻灯片、宣传册和网站)进行提问与回答的任务。由于纸质和电子文档在社会中极为普遍,这是一项具有实际应用价值的任务。该任务被认为颇具挑战性,因为它不仅需要理解文本,还需要理解图表内容,因此除了文本方法外,视觉问答(VQA)方法也常被用于研究。本文提出了日语文档问答(JDocQA)——一个大规模基于文档的问答数据集,该数据集本质上要求同时利用视觉和文本信息来回答问题,包含5,504份PDF格式文档以及11,600个标注的日语问答实例。每个问答实例均标注了答案线索所在的文档页面和边界框。为贴近真实问答应用场景,我们融入了多种问题类别以及文档中不可回答的问题。我们通过基于文本的大型语言模型(LLM)和多模态模型对该数据集的有效性进行了实证评估。在微调中引入不可回答问题可能有助于抑制所谓的幻觉生成现象。