The proliferation of radical content on online platforms poses significant risks, including inciting violence and spreading extremist ideologies. Despite ongoing research, existing datasets and models often fail to address the complexities of multilingual and diverse data. To bridge this gap, we introduce a publicly available multilingual dataset annotated with radicalization levels, calls for action, and named entities in English, French, and Arabic. This dataset is pseudonymized to protect individual privacy while preserving contextual information. Beyond presenting our freely available dataset, we analyze the annotation process, highlighting biases and disagreements among annotators and their implications for model performance. Additionally, we use synthetic data to investigate the influence of socio-demographic traits on annotation patterns and model predictions. Our work offers a comprehensive examination of the challenges and opportunities in building robust datasets for radical content detection, emphasizing the importance of fairness and transparency in model development.
翻译:在线平台上激进内容的扩散带来了重大风险,包括煽动暴力和传播极端主义意识形态。尽管研究持续进行,现有数据集和模型往往未能应对多语言及多样化数据的复杂性。为弥合这一差距,我们引入了一个公开可用的多语言数据集,该数据集标注了英语、法语和阿拉伯语内容的激进程度、行动呼吁及命名实体。此数据集经过假名化处理以保护个人隐私,同时保留了上下文信息。除了介绍我们免费提供的数据集,我们还分析了标注过程,强调了标注者之间的偏见与分歧及其对模型性能的影响。此外,我们利用合成数据研究了社会人口学特征对标注模式和模型预测的影响。我们的工作全面审视了构建用于激进内容检测的鲁棒数据集所面临的挑战与机遇,并强调了模型开发中公平性与透明度的重要性。