With natural language processing (NLP), researchers aim to enable computers to identify and understand patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntax, pragmatics and phonology, which need to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well-resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese, etc. Over 95% of the world's 7000 languages are low-resourced for NLP, i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on the Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word was used. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors.
翻译:通过自然语言处理(NLP),研究者致力于使计算机能够识别和理解人类语言中的模式。由于语言在其句法、语用和音系中蕴含了大量动态且多样的特性,这些特性需要被捕捉和处理,因此这一目标往往难以实现。随着NLP研究者不断拓展其边界,计算机处理自然语言的能力正在持续提升。然而,这些研究工作更多地聚焦于英语、日语、德语、法语、俄语、普通话等资源丰富的语言。在全球约7000种语言中,超过95%的语言在NLP领域属于资源匮乏型,即它们缺乏用于NLP工作的数据、工具和技术。本论文概述了变音符号的歧义问题,并回顾了此前在其他语言上采用的变音符号消歧方法。针对伊博语,我们报告了为开发一个灵活的变音符号恢复数据集生成框架所采取的步骤。我们提出了三种主要方法:标准n-元模型、分类模型和嵌入模型。标准n-元模型使用目标无变音符号词之前的词序列作为正确变体形式的关键预测因子。对于分类模型,我们使用了目标无变音符号词两侧的词语窗口。嵌入模型则通过比较上下文词嵌入的组合向量与各候选变体向量嵌入之间的相似度得分来进行预测。