Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.

翻译：数字技术的兴起显著增加了网络欺凌和在线虐待的可能性，因此需要加强检测和预防措施，尤其是在儿童群体中。本研究专注于检测斯瓦希里语中的混淆辱骂语言，这是一种低资源语言，由于其有限的语言资源和技术支持而带来独特的挑战。选择斯瓦希里语是因为其普及性，它是非洲使用最广泛的语言，拥有超过1600万母语使用者，总使用人数超过1亿，遍布东非地区和中东部分地区。我们采用了包括支持向量机（SVM）、逻辑回归和决策树在内的机器学习模型，并通过严格的参数调优和合成少数类过采样技术（SMOTE）等方法来处理数据不平衡问题。我们的分析表明，尽管这些模型在高维文本数据中表现良好，但我们数据集的小规模和失衡限制了研究结果的普适性。我们深入分析了精确率、召回率和F1分数，突出了每种模型在检测混淆语言时的细微性能差异。这项研究为更广泛地讨论如何确保儿童更安全的在线环境做出了贡献，主张扩展数据集并采用先进的机器学习技术以提高网络欺凌检测系统的有效性。未来的工作将侧重于增强数据鲁棒性、探索迁移学习以及整合多模态数据，以创建更全面且具有文化敏感性的检测机制。