Recently developed pre-trained text-and-layout models (PTLMs) have shown remarkable success in multiple information extraction tasks on visually-rich documents (VrDs). However, despite achieving extremely high performance on benchmarks, their real-world performance falls short of expectations. Owing to this issue, we investigate the prevailing evaluation pipeline to reveal that: (1) The inadequate annotations within benchmark datasets introduce spurious correlations between task inputs and labels, which would lead to overly-optimistic estimation of model performance. (2) The evaluation solely relies on the performance on benchmarks and is insufficient to comprehensively explore the capabilities of methods in real-world scenarios. These problems impede the prevailing evaluation pipeline from reflecting the real-world performance of methods, misleading the design choices of method optimization. In this work, we introduce EC-FUNSD, an entity-centric dataset crafted for benchmarking information extraction from visually-rich documents. This dataset contains diverse layouts and high-quality annotations. Additionally, this dataset disentangles the falsely-coupled segment and entity annotations that arises from the block-level annotation of FUNSD. Using the proposed dataset, we evaluate the real-world information extraction capabilities of PTLMs from multiple aspects, including their absolute performance, as well as generalization, robustness and fairness. The results indicate that prevalent PTLMs do not perform as well as anticipated in real-world information extraction scenarios. We hope that our study can inspire reflection on the directions of PTLM development.
翻译:近期开发的预训练文本与布局模型在富视觉文档的多种信息抽取任务中展现出显著成效。然而,尽管在基准测试中取得了极高的性能,其在实际应用中的表现却未达预期。针对这一问题,我们通过研究主流评估流程发现:(1)基准数据集中标注不充分导致任务输入与标签间存在虚假相关性,从而造成对模型性能的过度乐观估计。(2)现有评估仅依赖基准测试表现,不足以全面探究方法在真实场景中的能力。这些问题使得主流评估流程无法准确反映方法的实际性能,误导了方法优化的设计选择。本研究提出了EC-FUNSD——一个专为富视觉文档信息抽取基准测试构建的以实体为中心的数据集。该数据集包含多样化的布局结构与高质量的标注,同时解耦了FUNSD因区块级标注而产生的段落与实体错误耦合问题。基于该数据集,我们从多个维度评估了预训练文本与布局模型在真实场景中的信息抽取能力,包括绝对性能以及泛化性、鲁棒性与公平性。实验结果表明,主流预训练文本与布局模型在真实信息抽取场景中的表现并未达到预期水平。我们希望这项研究能引发对预训练文本与布局模型发展方向的深入思考。