The pressing need for digitization of historical document collections has led to a strong interest in designing computerised image processing methods for automatic handwritten text recognition (HTR). Handwritten text possesses high variability due to different writing styles, languages and scripts. Training an accurate and robust HTR system calls for data-efficient approaches due to the unavailability of sufficient amounts of annotated multi-writer text. A case study on an ongoing project ``Marginalia and Machine Learning" is presented here that focuses on automatic detection and recognition of handwritten marginalia texts i.e., text written in margins or handwritten notes. Faster R-CNN network is used for detection of marginalia and AttentionHTR is used for word recognition. The data comes from early book collections (printed) found in the Uppsala University Library, with handwritten marginalia texts. Source code and pretrained models are available at https://github.com/ektavats/Project-Marginalia.
翻译:历史文献集合数字化的迫切需求引发了对设计自动化手写文本识别(HTR)计算机图像处理方法的浓厚兴趣。由于书写风格、语言和字体的差异,手写文本具有高度可变性。训练准确且稳健的HTR系统需要采用数据高效的方法,因为标注的多作者手写文本数据量不足。本文以正在进行的项目"旁注与机器学习"为案例,重点研究手写旁注文本(即页边空白处的文字或手写笔记)的自动检测与识别。采用Faster R-CNN网络进行旁注检测,并使用AttentionHTR进行单词识别。数据来源于乌普萨拉大学图书馆馆藏的早期印刷书籍,其中包含手写旁注文本。源代码与预训练模型可在https://github.com/ektavats/Project-Marginalia 获取。