This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
翻译:本研究考察了非洲语言的数字表征现状及其对当前语言检测工具带来的挑战。我们评估了这些工具在约鲁巴语、基尼亚卢旺达语和阿姆哈拉语上的表现。尽管这些语言的使用者达数百万,但它们在对话平台上的在线使用往往稀疏、受英语影响严重,并不能代表母语者间普遍存在的真实单语对话。这种在线真实数据的缺乏,为训练语言模型带来了对话数据稀缺的挑战。为探究此问题,我们从每种语言的subreddit论坛和本地新闻源收集了数据。分析显示,两种数据源之间存在鲜明对比。Reddit数据量极少,且以严重的语码转换为特征。相反,本地新闻媒体提供了大量纯净的单语数据,这些数据也促使新闻发布者社交媒体页面上出现了更多本地语言的用户互动。语言检测模型(包括专门的AfroLID模型和通用大语言模型)在纯净的新闻数据上表现出近乎完美的准确率,但在语码转换的Reddit帖子上则表现不佳。研究结论表明,对于训练面向非洲语言的上下文丰富AI模型而言,专业策划的新闻内容比对话平台数据是更可靠、更有效的来源。研究同时强调,未来需要开发能够处理纯净文本与语码转换文本的模型,以提高对非洲语言的检测准确率。