Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR errors, post-processing algorithms have been proposed based on natural language analysis and machine learning techniques such as neural networks. Neural network's disadvantage is the vast amount of manually labeled data required for training, which is often unavailable. This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data. The main research goal is to develop a method for automatically generating language and task-specific training data to improve the neural network results for OCR post-correction, and to investigate which type of dataset is the most effective for OCR post-correction of historical documents. To this end, a series of experiments using several datasets was conducted. The evaluation corpus was based on Hebrew newspapers from the JPress project. An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors. We found that training the network using the proposed method is more effective than using randomly generated errors. The results also show that the performance of the neural network for OCR post-correction strongly depends on the genre and area of the training data. Moreover, neural networks that were trained with the proposed method outperform other state-of-the-art neural networks for OCR post-correction and complex spellcheckers. These results may have practical implications for many digital humanities projects.
翻译:过去几十年中,诸如书籍和报纸等大规模纸质文献档案已通过光学字符识别技术实现数字化。该技术容易产生错误,尤其对于历史文档而言。为纠正OCR错误,研究者提出了基于自然语言分析与机器学习技术(如神经网络)的后处理算法。神经网络的缺陷在于训练需要大量手工标注数据,而这类数据往往难以获取。本文提出一种创新方法,通过显著减少人工创建数据量,训练轻量级神经网络用于希伯来语OCR后校正。主要研究目标是开发一种自动生成语言及任务特定训练数据的方法以提升神经网络在OCR后校正中的表现,并探究何种数据集对历史文档的OCR后校正最为有效。为此,我们使用多个数据集开展了一系列实验。评估语料基于JPress项目的希伯来语报纸。通过对历史OCR报纸错误的分析,我们识别了常见语言错误及语料库特定的OCR错误类型。研究发现,采用本文方法训练网络比使用随机生成错误更为有效。结果还表明,神经网络在OCR后校正中的性能高度依赖训练数据的体裁与领域。此外,基于本文方法训练的神经网络在OCR后校正与复杂拼写检查任务中,均优于其他现有最优神经网络模型。这些成果对众多数字人文项目具有实际应用价值。