An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
翻译:当前自然语言处理领域持续存在的挑战在于,其主要进展往往不成比例地偏向资源丰富的语言,导致大量资源匮乏的语言被抛在后面。由于缺乏训练和评估模型所需的资源,大多数现代语言技术在处理濒危语言、本地语言及非标准化语言时要么完全不存在,要么可靠性不足。光学字符识别(OCR)常被用于将濒危语言文档转化为机器可读数据。然而,这类OCR输出通常含有噪声,而大多数词对齐模型并非为应对此类噪声环境而设计。本研究探讨了现有词级对齐模型在噪声场景下的表现,旨在提升其对噪声数据的鲁棒性。我们提出的噪声模拟与结构偏置方法,在多个语言对上进行了测试,成功将基于神经网络的先进对齐模型的错位率降低了最高59.6%。