Online propaganda poses a severe threat to the integrity of societies. However, existing datasets for detecting online propaganda have a key limitation: they were annotated using weak labels that can be noisy and even incorrect. To address this limitation, our work makes the following contributions: (1) We present \dataset: a novel dataset (N=30,000) for detecting online propaganda with high-quality labels. To the best of our knowledge, \dataset is the first dataset for detecting online propaganda that was created through human annotation. (2) We show empirically that state-of-the-art language models fail in detecting online propaganda when trained with weak labels (AUC: 64.03). In contrast, state-of-the-art language models can accurately detect online propaganda when trained with our high-quality labels (AUC: 92.25), which is an improvement of ~44%. (3) To address the cost of labeling, we extend our work to few-shot learning. Specifically, we show that prompt-based learning using a small sample of high-quality labels can still achieve a reasonable performance (AUC: 80.27). Finally, we discuss implications for the NLP community to balance the cost and quality of labeling. Crucially, our work highlights the importance of high-quality labels for sensitive NLP tasks such as propaganda detection.
翻译:在线宣传严重威胁社会诚信体系的完整性。然而,现有用于检测在线宣传的数据集存在关键局限:其标注采用弱标签,可能导致标签噪声甚至错误。为解决该问题,本研究做出以下贡献:(1) 提出 \dataset 数据集:包含高质量标签的新颖数据集(N=30,000)。据我们所知,\dataset 是首个通过人工标注创建的在线宣传检测数据集。(2) 通过实证表明,基于弱标签训练时,最先进语言模型无法有效检测在线宣传(AUC:64.03),而采用本数据集的高质量标签训练后,模型可实现精确检测(AUC:92.25),性能提升约44%。(3) 为降低标注成本,我们将研究扩展至小样本学习场景:基于小规模高质量标签样本的提示学习仍能取得合理性能(AUC:80.27)。最后,本文探讨了自然语言处理领域在标注成本与质量之间寻求平衡的启示。关键意义在于,本研究凸显了高质量标签对宣传检测等敏感NLP任务的重要性。