Numerous pre-training techniques for visual document understanding (VDU) have recently shown substantial improvements in performance across a wide range of document tasks. However, these pre-trained VDU models cannot guarantee continued success when the distribution of test data differs from the distribution of training data. In this paper, to investigate how robust existing pre-trained VDU models are to various distribution shifts, we first develop an out-of-distribution (OOD) benchmark termed Do-GOOD for the fine-Grained analysis on Document image-related tasks specifically. The Do-GOOD benchmark defines the underlying mechanisms that result in different distribution shifts and contains 9 OOD datasets covering 3 VDU related tasks, e.g., document information extraction, classification and question answering. We then evaluate the robustness and perform a fine-grained analysis of 5 latest VDU pre-trained models and 2 typical OOD generalization algorithms on these OOD datasets. Results from the experiments demonstrate that there is a significant performance gap between the in-distribution (ID) and OOD settings for document images, and that fine-grained analysis of distribution shifts can reveal the brittle nature of existing pre-trained VDU models and OOD generalization algorithms. The code and datasets for our Do-GOOD benchmark can be found at https://github.com/MAEHCM/Do-GOOD.
翻译:针对视觉文档理解(VDU)的多种预训练技术近期在各类文档任务上展现了显著的性能提升。然而,当测试数据分布与训练数据分布存在差异时,这些预训练VDU模型无法保证持续有效的表现。为探究现有预训练VDU模型对各类分布偏移的鲁棒性,本文首先构建了一个面向文档图像任务的细粒度分布外(OOD)基准测试集——Do-GOOD。该基准测试定义了导致不同分布偏移的底层机制,并包含覆盖文档信息抽取、分类及问答等3项VDU相关任务的9个OOD数据集。随后,我们评估了5种最新VDU预训练模型与2种典型OOD泛化算法在这些数据集上的鲁棒性并进行了细粒度分析。实验结果表明,文档图像在分布内(ID)与OOD设置之间存在着显著性能差距,且分布偏移的细粒度分析能够揭示现有预训练VDU模型及OOD泛化算法的脆弱本质。Do-GOOD基准测试的代码与数据集可通过https://github.com/MAEHCM/Do-GOOD获取。