BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

翻译：摘要：社交媒体上假新闻的快速传播已成为重大挑战，尤其在多语种且资源匮乏的北非地区。本文介绍BOUTEF——一个大规模多语种语料库，旨在研究阿尔及利亚和突尼斯假新闻的传播特征、模式及其影响。该语料库整合三个互补组成部分：虚假叙事、真实叙事及相关用户生成评论，并附有经核实的辟谣信息。其覆盖广泛的语言及语言变体，包括现代标准阿拉伯语、阿尔及利亚及突尼斯方言、阿拉伯字母转写、法语、英语及语码转换语言。基于该资源，我们采用定量与定性相结合的方法开展全面实证分析，考察主题分布、语言与修辞策略、情感模式及社交互动动态。统计分析揭示：主题类别与信息真实性之间存在显著关联，用户参与度与虚假内容可见性之间亦呈强相关。研究发现，假新闻高度依赖情感化叙事、煽情框架及混合语言实践，以增强病毒式传播与受众参与度；相比之下，辟谣内容则更偏向事实核查导向风格。此外，阿尔及利亚与突尼斯的比较分析既凸显出共性传播动态，也揭示了受社会政治情境塑造的国别特征。研究结果强调了非正式语言实践在错误信息扩散与接收中的作用。通过提供经过丰富标注且公开可用的数据集，本工作为推进假新闻检测、低资源语言处理及复杂语言环境中信息失序现象的研究作出贡献。