We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.
翻译:本文介绍了关于变音符号恢复的实验,这是自然语言处理任务中一种重要的文本规范化形式。我们的研究聚焦于两种资源极度匮乏的语言:哥斯达黎加使用的奇布钱语系语言布里布里语,以及库克群岛使用的波利尼西亚语系语言库克群岛毛利语。具体而言,本文:(i)比较了适用于低资源语言(包括声调变音符号)的变音符号恢复算法;(ii)考察了达到目标性能水平所需的数据量;(iii)对比了不同资源条件下的结果;(iv)探讨了相关的变音符号校正任务。我们发现,经过微调的字符级大语言模型表现最佳,这很可能归因于其能够将复杂字符分解为UTF-8字节表示的能力。相比之下,大规模多语言模型在我们的数据限制下表现欠佳。对于所有模型,当数据量达到约10,000词时,性能开始可靠显现。零样本方法在所有情况下表现均不佳。本研究既回应了语言社区的请求,也回应了关于低资源环境下模型性能与泛化能力的更广泛的自然语言处理研究问题。