Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03 when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.
翻译:从变音符号语言的老化文献中提取细粒度OCR文本仍然具有挑战性,主要由于意外的伪影、时间导致的退化以及数据集的缺乏。虽然已有独立的拼写校正方法被提出,但由于历史文献中存在大量可能的OCR错误组合以及现代与古典语料库分布之间的差异,这些方法在历史文献上表现有限。我们提出一种方法,利用可获得的以内容为中心的电子书作为参考基准,在大型语言模型的支持下校正不完美的OCR生成文本。该技术为变音符号语言生成高精度的伪页面到页面标签,其中在历史条件下,微小笔画构成了重大挑战。该流程消除了老化文献中的各类噪声,并解决了字符缺失、词语缺失及序列错乱等问题。我们的后处理方法生成了一个大型古典越南语书籍OCR数据集,在10分制评分中平均得分为8.72分。这优于最先进的基于Transformer的越南语拼写校正模型,后者在数据集的抽样子集上评估得分为7.03。我们还训练了一个基线OCR模型,以评估其并与知名引擎进行比较。实验结果证明了我们的基线模型相较于广泛使用的开源解决方案的优势。最终的数据集将公开发布以支持未来研究。