DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives

Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.

翻译：印度尼西亚是世界上语言多样性最为丰富的国家之一。然而，尽管语言种类繁多，印尼诸语言在自然语言处理（NLP）研究与技术中仍处于代表性不足的状态。过去两年间，已有若干为印尼语言构建NLP资源的尝试。然而，这些努力大多集中于创建人工资源，因此难以扩展到更多语言。尽管许多印尼语言缺乏网络存在，但在本地仍有以印刷形式（如书籍、杂志、报纸）良好记录这些语言的资源。将这些现有资源数字化，将使印尼语言资源的构建能够扩展到更多语种。本文提出一种通过文档数字化创建数据集的新方法，这类文档此前在印尼尚未被用于构建数字语言资源。DriveThru是一个利用光学字符识别（OCR）技术提取文档内容的平台，旨在以更低的人工成本和更少的精力构建语言资源。本文还研究了当前最先进的大型语言模型（LLM）在OCR后校正中的应用，结果表明相较于通用OCR系统，该方法能有效提升字符准确率（CAR）与词汇准确率（WAR）。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日