Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.
翻译:印度尼西亚是世界上语言多样性最为丰富的国家之一。然而,尽管语言种类繁多,印尼诸语言在自然语言处理(NLP)研究与技术中仍处于代表性不足的状态。过去两年间,已有若干为印尼语言构建NLP资源的尝试。然而,这些努力大多集中于创建人工资源,因此难以扩展到更多语言。尽管许多印尼语言缺乏网络存在,但在本地仍有以印刷形式(如书籍、杂志、报纸)良好记录这些语言的资源。将这些现有资源数字化,将使印尼语言资源的构建能够扩展到更多语种。本文提出一种通过文档数字化创建数据集的新方法,这类文档此前在印尼尚未被用于构建数字语言资源。DriveThru是一个利用光学字符识别(OCR)技术提取文档内容的平台,旨在以更低的人工成本和更少的精力构建语言资源。本文还研究了当前最先进的大型语言模型(LLM)在OCR后校正中的应用,结果表明相较于通用OCR系统,该方法能有效提升字符准确率(CAR)与词汇准确率(WAR)。