An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

翻译：社交媒体快速发展加剧了谣言的传播。在阿尔及利亚语境下，由于方言内容具有非正式性和语码转换特性、标注资源稀缺以及标准阿拉伯语自然语言处理工具对方言文本效果有限，这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过融合真实社交媒体帖子、合成数据及FASSILA语料库，并基于相似性标注流程实现自动标注，构建了领域专用标注数据集。同时引入音译转换流水线，生成阿拉伯文字和Arabizi文的平行数据集。我们对经典机器学习、深度学习、Transformer和混合模型等多种方法进行了评估。实验结果表明，结合Transformer嵌入与经典分类器的混合方法性能最优，F1分数达到0.84。我们还发现，领域特定预训练比模型规模更为重要，经社交媒体训练的模型优于在标准阿拉伯语语料库上训练的大型模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

虚假信息检测综述

专知会员服务

8+阅读 · 2025年7月9日

《假新闻检测的特征计算流程：基于大语言模型的提取方法》

专知会员服务

15+阅读 · 2025年7月3日

【CVPR2024】SNIFFER：用于可解释的脱离上下文谣言检测的多模态大型语言模型

专知会员服务

19+阅读 · 2024年3月6日

《利用大型语言模型检测社交平台上的网络欺凌行为》

专知会员服务

45+阅读 · 2024年1月23日