Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S\'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S\'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S\'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S\'ami languages, even with a moderate amount of manually annotated data.
翻译:光学字符识别(OCR)在挪威国家图书馆(NLN)的数字化进程中至关重要,它将扫描文档转换为机器可读文本。然而,对于NLN馆藏中的萨米语文档,OCR的准确率不足。鉴于OCR质量影响下游处理流程,为使得这些资源可访问,评估并改进针对萨米语文本的OCR是必要的。为应对这一需求,本研究针对转录NLN馆藏中的萨米语文本,对三种成熟的OCR方法——Transkribus、Tesseract和TrOCR——进行了微调与评估。我们的结果表明,在此任务上,Transkribus和TrOCR的表现优于Tesseract,而Tesseract在一个领域外数据集上取得了更优的性能。此外,我们证明了即使仅使用中等数量的手动标注数据,通过微调预训练模型,并利用机器标注和合成文本图像来补充手动标注,也能为萨米语获得准确的OCR结果。