The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.
翻译:社交媒体快速发展加剧了谣言的传播。在阿尔及利亚语境下,由于方言内容具有非正式性和语码转换特性、标注资源稀缺以及标准阿拉伯语自然语言处理工具对方言文本效果有限,这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过融合真实社交媒体帖子、合成数据及FASSILA语料库,并基于相似性标注流程实现自动标注,构建了领域专用标注数据集。同时引入音译转换流水线,生成阿拉伯文字和Arabizi文的平行数据集。我们对经典机器学习、深度学习、Transformer和混合模型等多种方法进行了评估。实验结果表明,结合Transformer嵌入与经典分类器的混合方法性能最优,F1分数达到0.84。我们还发现,领域特定预训练比模型规模更为重要,经社交媒体训练的模型优于在标准阿拉伯语语料库上训练的大型模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。