DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

翻译：现实应用中的文档理解通常需要处理异构、多页的文档包，这些文档包由多个文档拼接而成。尽管视觉文档理解领域近期取得了进展，但文档包分割这一基础任务——即将文档包分离为独立单元——在很大程度上仍未得到解决。我们提出了首个综合性基准数据集DocSplit，以及用于评估大语言模型文档包分割能力的新型评估指标。DocSplit包含五个不同复杂度的数据集，涵盖多样化的文档类型、版式及多模态场景。我们形式化了DocSplit任务，要求模型识别文档边界、分类文档类型并保持文档包内正确的页面顺序。该基准针对现实挑战而设计，包括乱序页面、交错排列的文档以及缺乏明确分界标识的文档。我们通过大量实验在多模态大语言模型上评估了数据集的性能，揭示了当前模型在处理复杂文档分割任务时存在的显著能力差距。DocSplit基准数据集与提出的新型评估指标为推进文档理解能力提供了系统化框架，这对于法律、金融、医疗及其他文档密集型领域至关重要。我们公开数据集以促进未来在文档包处理方面的研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文档视觉问答简述

专知会员服务

7+阅读 · 2025年10月17日

Transformer如何做视觉分割？南洋理工最新《基于Transformer的视觉分割》综述，详述120多个深度分割模型

专知会员服务

56+阅读 · 2023年4月27日

文档智能: 数据集、模型和应用

专知会员服务

63+阅读 · 2022年7月31日