Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.
翻译:现实应用中的文档理解通常需要处理异构、多页的文档包,这些文档包由多个文档拼接而成。尽管视觉文档理解领域近期取得了进展,但文档包分割这一基础任务——即将文档包分离为独立单元——在很大程度上仍未得到解决。我们提出了首个综合性基准数据集DocSplit,以及用于评估大语言模型文档包分割能力的新型评估指标。DocSplit包含五个不同复杂度的数据集,涵盖多样化的文档类型、版式及多模态场景。我们形式化了DocSplit任务,要求模型识别文档边界、分类文档类型并保持文档包内正确的页面顺序。该基准针对现实挑战而设计,包括乱序页面、交错排列的文档以及缺乏明确分界标识的文档。我们通过大量实验在多模态大语言模型上评估了数据集的性能,揭示了当前模型在处理复杂文档分割任务时存在的显著能力差距。DocSplit基准数据集与提出的新型评估指标为推进文档理解能力提供了系统化框架,这对于法律、金融、医疗及其他文档密集型领域至关重要。我们公开数据集以促进未来在文档包处理方面的研究。