Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.
翻译:手写文本识别(HTR)仍然是一项具有挑战性的任务,尤其对于共享共同格式和上下文特征的多页文档而言。尽管现代光学字符识别(OCR)引擎在处理印刷文本方面表现优异,但其在手写文本上的性能有限,通常需要昂贵的标注数据进行微调。本文探索了在零样本设置下使用多模态大语言模型(MLLMs)进行多页手写文档转录的方法。我们研究了商用OCR引擎与MLLMs的多种配置方案,将后者分别用作端到端转录器和后处理器,并考察包含与不包含图像组件的情况。我们提出了一种新颖的“+首页”方法,该方法通过向MLLM提供整个文档的OCR输出及仅首页图像,显著提升了转录性能。该策略能够有效利用文档的共享特征,同时避免了处理全部图像的高昂成本。在基于IAM手写数据库构建的多页文档数据集上的实验表明,“+首页”方法不仅提高了转录准确率,实现了成本与性能的平衡,还能通过从单页样本中推断格式特征与OCR错误模式,有效提升对样本外文本的转录效果。