Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future. Our dataset will be publicly available for non-commercial use at https://nextplusplus.github.io/TAT-DQA/.
翻译:文档视觉问答(Document VQA)旨在理解视觉丰富的文档以回答自然语言问题,这是自然语言处理和计算机视觉领域的一个新兴研究课题。在这项工作中,我们引入了一个新的文档VQA数据集,名为TAT-DQA,该数据集通过扩展TAT-QA数据集,包含3,067个文档页面(涵盖半结构化表格和非结构化文本)以及16,558个问答对。这些文档源自真实的财务报告,包含大量数字,因此需要离散推理能力来回答该数据集上的问题。基于TAT-DQA,我们进一步开发了一种名为MHST的新模型,该模型综合考虑了包括文本、布局和视觉图像在内的多模态信息,以智能地采用相应策略(即提取或推理)处理不同类型的问题。大量实验表明,MHST模型显著优于基线方法,证明了其有效性。然而,其性能仍远落后于人类专家。我们期待新的TAT-DQA数据集能够促进对视觉丰富文档的深度理解研究(融合视觉与语言),特别是需要离散推理的场景。同时,我们希望所提出的模型能激励研究者未来设计出更先进的文档VQA模型。我们的数据集将在https://nextplusplus.github.io/TAT-DQA/上以非商业用途形式公开提供。