Page Stream Segmentation (PSS) is an essential prerequisite for automated document processing at scale. However, research progress has been limited by the absence of realistic public benchmarks. This paper works towards addressing this gap by introducing TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods. Our results show that decoder-based LLMs outperform smaller multimodal encoders. Through a review of existing PSS research and datasets, we identify key challenges and advancements in the field. Our findings highlight the key importance of robust OCR, providing valuable insights for the development of more effective document processing systems.
翻译:页面流分割(PSS)是大规模自动化文档处理的关键前提。然而,由于缺乏真实的公共基准测试,该领域的研究进展一直受限。本文通过引入TABME++基准测试来填补这一空白,该基准采用商业光学字符识别(OCR)标注进行增强。我们评估了大规模语言模型(LLMs)在PSS任务上的性能,重点关注基于解码器架构且通过参数高效方法微调的模型。实验结果表明,基于解码器的LLMs优于较小的多模态编码器模型。通过对现有PSS研究与数据集的系统梳理,我们指出了该领域面临的核心挑战与重要进展。我们的研究结果凸显了高质量OCR系统的基础性作用,为开发更高效的文档处理系统提供了重要参考。