The massive use of digital documents due to the substantial trend of paperless initiatives confronted some companies to find ways to process thousands of documents per day automatically. To achieve this, they use automatic information retrieval (IR) allowing them to extract useful information from large datasets quickly. In order to have effective IR methods, it is first necessary to have an adequate dataset. Although companies have enough data to take into account their needs, there is also a need for a public database to compare contributions between state-of-the-art methods. Public data on the document exists as DocVQA[2] and XFUND [10], but these do not fully satisfy the needs of companies. XFUND contains only form documents while the company uses several types of documents (i.e. structured documents like forms but also semi-structured as invoices, and unstructured as emails). Compared to XFUND, DocVQA has several types of documents but only 4.5% of them are corporate documents (i.e. invoice, purchase order, etc). All of this 4.5% of documents do not meet the diversity of documents required by the company. We propose CHIC a visual question-answering public dataset. This dataset contains different types of corporate documents and the information extracted from these documents meet the right expectations of companies.
翻译:摘要:随着无纸化倡议的显著趋势,数字文档的大量使用使一些企业面临如何每日自动处理数千份文档的挑战。为此,他们采用自动信息检索(IR)技术,以便快速从大规模数据集中提取有用信息。要实现高效的IR方法,首先需要具备合适的数据集。尽管企业拥有足够的数据来满足自身需求,但公共数据库对于比较最先进方法之间的贡献仍是必要的。现有文档类公共数据集,如DocVQA[2]和XFUND[10],并未完全满足企业的需求。XFUND仅包含表单类文档,而企业实际使用多种类型的文档(例如结构化文档如表格,半结构化文档如发票,以及非结构化文档如电子邮件)。相较于XFUND,DocVQA虽包含多种文档类型,但其中仅有4.5%为公司文档(如发票、采购订单等),且这4.5%的文档未能满足企业所需的多样性。为此,我们提出CHIC,一个面向视觉问答的公共数据集。该数据集包含不同类型的公司文档,且从中提取的信息完全符合企业的实际期望。