Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian-Pidgin spoken by approx. 120M speakers in Nigeria and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are now various platforms publishing exclusively in Naija such as Naija Wikipedia. However, it is hard to distinguish by non-native from a larger pidgin languages spoken across West Africa known as West African Pidgin English (WAPE) -- which is more simplied and understandable by wider audience in Ghana, Nigeria, and Cameroon. BBC news platform publishes exclusively in WAPE to cater for several countries in West Africa. In our paper, we show through statistical analyses and Machine Translation experiments that these two creole varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is under-represented in Generative AI, and it is hard to teach LLMs with few examples.
翻译:尼日利亚是一个拥有500多种语言的多语种国家。Naija(尼日利亚皮钦语)是一种在尼日利亚约有1.2亿人使用的混合语言(包含英语、葡萄牙语、约鲁巴语、豪萨语和伊博语等元素)。尽管长期以来主要作为口语使用,但目前已出现多个专门使用Naija语的平台,例如Naija维基百科。然而,非母语者很难将其与通行于西非地区、受众更广的西非皮钦英语(WAPE)区分开来——后者在加纳、尼日利亚和喀麦隆等国家使用更简化且更易理解。BBC新闻平台专门使用WAPE进行内容发布以覆盖西非多国受众。本文通过统计分析和机器翻译实验证明,这两种克里奥尔语变体并不相互代表(即在词序和词汇方面存在语言学差异),且生成式人工智能仅基于WAPE运作。换言之,Naija语在生成式人工智能中存在代表性不足的问题,且难以通过少量示例有效训练大型语言模型。