Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.
翻译:人文学科研究者高度依赖古代手稿来研究历史、宗教及过往的社会政治结构。尽管学界已投入大量精力通过OCR技术对这些珍贵手稿进行数字化处理,但由于多数手稿历经数百年磨损,光学字符识别(OCR)程序难以准确捕捉褪色文字与页面污渍。本文提出一种基于谷歌OCR藏文手稿构建的神经拼写校正模型,用于自动修正含噪的OCR输出结果。本文分为四个部分:数据集构建、模型架构设计、训练及分析。首先,通过特征工程将原始藏文电子文本语料库转化为两组结构化数据框架——一组配对玩具数据集与一组配对真实数据集。随后,在Transformer架构中嵌入置信度评分机制以实现拼写校正任务。根据损失值与字符错误率的对比,本研究所提出的Transformer+置信度评分机制架构在性能上优于Transformer、LSTM-2-LSTM及GRU-2-GRU等架构。最后,为验证模型鲁棒性,我们分析了错误标记,并可视化展示了模型内部的注意力与自注意力热力图。