How good are Large Language Models on African Languages?

Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric~(around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.

翻译：自然语言处理的最新进展导致了大型语言模型（LLMs）的涌现。研究表明，这些模型即使在没有经过训练的任务和语言上，也能通过上下文学习展现出良好的性能。然而，相较于高资源语言，它们在非洲语言上的表现尚未得到充分研究。我们对四种主流大型语言模型（mT0、Aya、LLaMa 2 和 GPT-4）在六个任务（主题分类、情感分类、机器翻译、摘要生成、问答和命名实体识别）上进行了分析，覆盖了60种非洲语言，这些语言分属不同的语系和地理区域。结果表明，所有LLMs在非洲语言上的表现均较低，且与高资源语言（如英语）相比，在大多数任务上存在显著的性能差距。我们发现GPT-4在分类任务上表现中等偏上，但在机器翻译和摘要生成等生成式任务上性能明显不足。令人意外的是，mT0在跨语言问答任务上取得了最佳整体表现，优于当前最先进的监督模型（即微调后的mT5）和GPT-4在非洲语言上的结果。类似地，我们发现最新的Aya模型在几乎所有任务上与mT0表现相当，仅在主题分类任务上优于mT0。总体而言，LLaMa 2表现最差，我们认为这归因于其以英语和代码为中心（约98%）的预训练语料库。我们的研究证实，非洲语言上的性能仍对当前LLMs构成挑战，凸显了缩小这一差距亟需更多努力。