Naija is the Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese and Indigenous languages). Although it has mainly been a spoken language until recently, there are currently two written genres (BBC and Wikipedia) in Naija. Through statistical analyses and Machine Translation experiments, we prove that these two genres do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on Naija written in the BBC genre. In other words, Naija written in Wikipedia genre is not represented in Generative AI.
翻译:Naija(尼日利亚皮钦语)由尼日利亚约1.2亿使用者使用,是一种混合语言(融合了英语、葡萄牙语及本土语言)。尽管该语言近期仍以口语形式为主,但目前存在两种书面语体裁(BBC与维基百科)。通过统计分析和机器翻译实验,我们证明这两种体裁相互之间不具备代表性(即在词序与词汇上存在语言学差异),且生成式AI仅基于BBC体裁的Naija书面语运行。换言之,维基百科体裁的Naija书面语在生成式AI中未被体现。