Document image classification is different from plain-text document classification and consists of classifying a document by understanding the content and structure of documents such as forms, emails, and other such documents. We show that the only existing dataset for this task (Lewis et al., 2006) has several limitations and we introduce two newly curated multilingual datasets WIKI-DOC and MULTIEURLEX-DOC that overcome these limitations. We further undertake a comprehensive study of popular visually-rich document understanding or Document AI models in previously untested setting in document image classification such as 1) multi-label classification, and 2) zero-shot cross-lingual transfer setup. Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages. Our datasets and findings open the door for future research into improving Document AI models.
翻译:文档图像分类不同于纯文本文档分类,它需要通过理解表格、电子邮件等文档的内容和结构来对文档进行分类。我们表明,现有唯一的此类数据集(Lewis 等人,2006)存在若干局限性,并引入了两个新策划的多语言数据集 WIKI-DOC 和 MULTIEURLEX-DOC,以克服这些局限性。我们进一步在文档图像分类中先前未测试的情景下,对流行的视觉丰富文档理解或文档 AI 模型进行了全面研究,如:1)多标签分类,以及 2)零样本跨语言迁移设置。实验结果表明,多语言文档 AI 模型在跨类型学上距离较远的语言迁移方面存在局限性。我们的数据集和发现为未来改进文档 AI 模型的研究打开了大门。