In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $\pi$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $\pi$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.
翻译:本文介绍了一种用于纳瓦特尔语的上下文无关文法。纳瓦特尔语是一种属于π语言类型的美洲原住民语言,即数字资源稀缺的语言,其中可用于机器学习的语料库几乎不存在。本文的目标是生成大量语法正确的人工句子,以增加可用于语言模型训练的语料库。我们希望证明,该文法能够显著扩展我们称为π-yalli的纳瓦特尔语料库。通过如此增强的语料库,我们能够训练如FastText等算法,并在句子级语义任务上对其进行评估。初步结果表明,使用该文法相比某些大语言模型取得了比较性改进。然而,研究发现,要实现更显著的改进,需要能够更有效建模纳瓦特尔语言的文法。