Detection of Offensive and Threatening Online Content in a Low Resource Language

Hausa is a major Chadic language, spoken by over 100 million people in Africa. However, from a computational linguistic perspective, it is considered a low-resource language, with limited resources to support Natural Language Processing (NLP) tasks. Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language, which can go undetected due to the lack of detection systems designed for Hausa. This study aimed to address this issue by (1) conducting two user studies (n=308) to investigate cyberbullying-related issues, (2) collecting and annotating the first set of offensive and threatening datasets to support relevant downstream tasks in Hausa, (3) developing a detection system to flag offensive and threatening content, and (4) evaluating the detection system and the efficacy of the Google-based translation engine in detecting offensive and threatening terms in Hausa. We found that offensive and threatening content is quite common, particularly when discussing religion and politics. Our detection system was able to detect more than 70% of offensive and threatening content, although many of these were mistranslated by Google's translation engine. We attribute this to the subtle relationship between offensive and threatening content and idiomatic expressions in the Hausa language. We recommend that diverse stakeholders participate in understanding local conventions and demographics in order to develop a more effective detection system. These insights are essential for implementing targeted moderation strategies to create a safe and inclusive online environment.

翻译：豪萨语是非洲查德语系的主要语言，拥有超过1亿使用者。然而，从计算语言学角度来看，它被视为低资源语言，支持自然语言处理（NLP）任务的资源有限。在线平台常促进社交互动，可能导致攻击性和威胁性语言的使用，而由于缺乏针对豪萨语的检测系统，这些语言可能未被发现。本研究旨在通过以下措施解决该问题：（1）开展两项用户研究（n=308）以调查网络欺凌相关问题；（2）收集并标注首批攻击性与威胁性数据集，以支持豪萨语相关下游任务；（3）开发检测系统以标记攻击性和威胁性内容；（4）评估该检测系统及基于谷歌的翻译引擎在检测豪萨语中攻击性与威胁性词汇方面的有效性。研究发现，攻击性与威胁性内容相当普遍，尤其在讨论宗教与政治话题时。我们的检测系统能够识别超过70%的攻击性与威胁性内容，但其中许多内容被谷歌翻译引擎误译。我们将此归因于豪萨语中攻击性与威胁性内容与习语表达之间的微妙关系。建议多方利益相关者参与理解当地习俗与人口特征，以开发更有效的检测系统。这些洞见对于实施针对性审核策略、营造安全包容的在线环境至关重要。