Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian Pidgin spoken by approximately 120M speakers and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are some online platforms (e.g., Wikipedia), publishing in written Naija as well. West African Pidgin English (WAPE) is also spoken in Nigeria and it is used by BBC to broadcast news on the internet to a wider audience not only in Nigeria but also in other West African countries (e.g., Cameroon and Ghana). Through statistical analyses and Machine Translation experiments, our paper shows that these two pidgin varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is underrepresented in Generative AI, and it is hard to teach LLMs with few examples. In addition to the statistical analyses, we also provide historical information on both pidgins as well as insights from the interviews conducted with volunteer Wikipedia contributors in Naija.
翻译:尼日利亚是一个拥有500多种语言的多语种国家。Naija(尼日利亚皮钦语)是一种混合语言(包含英语、葡萄牙语、约鲁巴语、豪萨语和伊博语等元素),使用人口约1.2亿。尽管其主要作为口语存在,但近期已有部分网络平台(如维基百科)开始使用书面Naija进行内容发布。西非皮钦英语(WAPE)在尼日利亚同样通行,并被BBC用于面向尼日利亚及其他西非国家(如喀麦隆和加纳)的网络新闻广播。通过统计分析与机器翻译实验,本文证明这两种皮钦语变体并不相互代表(即在词序与词汇层面存在语言学差异),且生成式人工智能仅基于WAPE运作。换言之,Naija在生成式人工智能中存在代表性不足的问题,且难以通过少量示例有效训练大语言模型。除统计分析外,本文还提供了两种皮钦语的历史背景信息,以及对Naija维基百科志愿贡献者的访谈洞见。