This paper investigates the language of propaganda and its stylistic features. It presents the PPN dataset, standing for Propagandist Pseudo-News, a multisource, multilingual, multimodal dataset composed of news articles extracted from websites identified as propaganda sources by expert agencies. A limited sample from this set was randomly mixed with papers from the regular French press, and their URL masked, to conduct an annotation-experiment by humans, using 11 distinct labels. The results show that human annotators were able to reliably discriminate between the two types of press across each of the labels. We propose different NLP techniques to identify the cues used by the annotators, and to compare them with machine classification. They include the analyzer VAGO to measure discourse vagueness and subjectivity, a TF-IDF to serve as a baseline, and four different classifiers: two RoBERTa-based models, CATS using syntax, and one XGBoost combining syntactic and semantic features. Keywords: Propaganda, Fake News, Explainability, AI alignment, Vagueness, Subjectivity, Exaggeration, Stylistic analysis
翻译:本文研究宣传语言及其文体特征。我们提出了PPN数据集(传播性伪新闻),这是一个多源、多语言、多模态的数据集,包含从专家机构认定的宣传源网站中提取的新闻文章。从该数据集中随机抽取有限样本,与法国正规媒体的文章混合,并隐藏其网址,由人类使用11种不同标签进行标注实验。结果表明,人类标注者能够可靠地根据每个标签区分这两种类型的媒体。我们提出多种自然语言处理技术来识别标注者使用的线索,并将其与机器分类进行比较。这些技术包括:用于测量话语模糊性和主观性的VAGO分析器、作为基准的TF-IDF,以及四种不同的分类器:两种基于RoBERTa的模型、使用句法的CATS模型,以及融合句法和语义特征的XGBoost模型。关键词:宣传、假新闻、可解释性、AI对齐、模糊性、主观性、夸张、文体分析