HQP: A Human-Annotated Dataset for Detecting Online Propaganda - 专知论文

会员服务 ·

0

在线 · AUC · 标注 · state-of-the-art · 数据集 ·

2023 年 5 月 1 日

HQP: A Human-Annotated Dataset for Detecting Online Propaganda

翻译：HQP：面向在线宣传检测的高质量人工标注数据集

Abdurahman Maarouf,Dominik Bär,Dominique Geissler,Stefan Feuerriegel

Online propaganda poses a severe threat to the integrity of societies. However, existing datasets for detecting online propaganda have a key limitation: they were annotated using weak labels that can be noisy and even incorrect. To address this limitation, our work makes the following contributions: (1) We present HQP: a novel dataset (N=30,000) for detecting online propaganda with high-quality labels. To the best of our knowledge, HQP is the first dataset for detecting online propaganda that was created through human annotation. (2) We show empirically that state-of-the-art language models fail in detecting online propaganda when trained with weak labels (AUC: 64.03). In contrast, state-of-the-art language models can accurately detect online propaganda when trained with our high-quality labels (AUC: 92.25), which is an improvement of ~44%. (3) To address the cost of labeling, we extend our work to few-shot learning. Specifically, we show that prompt-based learning using a small sample of high-quality labels can still achieve a reasonable performance (AUC: 80.27). Finally, we discuss implications for the NLP community to balance the cost and quality of labeling. Crucially, our work highlights the importance of high-quality labels for sensitive NLP tasks such as propaganda detection.

翻译：在线宣传对社会的完整性构成了严重威胁。然而，现有用于检测在线宣传的数据集存在关键局限：其标注采用弱标签方式，这些标签可能存在噪声甚至错误。为解决此问题，我们的工作做出以下贡献：(1) 提出HQP——一个包含3万条高质量标签的新数据集，用于检测在线宣传。据我们所知，HQP是首个通过人工标注创建的在线宣传检测数据集。(2) 通过实验证明，当使用弱标签训练时，最先进语言模型在检测在线宣传时效果不佳（AUC: 64.03）。相比之下，使用我们的高质量标签训练时，最先进语言模型能够准确检测在线宣传（AUC: 92.25），性能提升约44%。(3) 为解决标注成本问题，我们将工作扩展到少样本学习场景。具体而言，我们证明基于提示的学习方法仅需少量高质量标签样本即可获得合理性能（AUC: 80.27）。最后，我们探讨了NLP领域在平衡标注成本与质量方面的启示。关键的是，我们的工作凸显了对于宣传检测等敏感NLP任务而言，高质量标签至关重要。

0

相关内容

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

80+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

基于聚合的社会化短文本信息处理与细粒度倾向性分析

国家自然科学基金

0+阅读 · 2015年12月31日

长链非编码RNA-TRA调控乳腺癌内分泌耐药的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

斜硅石(moganite)高温晶体结构和相变的固体光谱学研究

国家自然科学基金

0+阅读 · 2012年12月31日

有限理性下的自媒体证券信息传播：资源价值与负面效应

国家自然科学基金

0+阅读 · 2012年12月31日

车载激光扫描点云与全景影像的高精度配准方法

国家自然科学基金

0+阅读 · 2012年12月31日

钒氧化物-石墨烯纳米复合材料的制备及催化苯羟基化性能

国家自然科学基金

0+阅读 · 2012年12月31日

DEC1、DEC2对人乳腺癌细胞衰老的调控作用及其作用机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

疏肝益肾方抗乳腺癌内分泌治疗耐药的增效机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于MEMS二维扫描镜的激光主动4D成像制导技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

Ca3Co2O6基自旋失措体系的磁台阶与磁介电效应

国家自然科学基金

0+阅读 · 2008年12月31日

Generated Graph Detection

Arxiv

0+阅读 · 2023年6月13日

Automatic Change-Point Detection in Time Series via Deep Learning

Arxiv

0+阅读 · 2023年6月9日

Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Arxiv

0+阅读 · 2023年6月8日

A Systematic Survey on Deep Generative Models for Graph Generation

Arxiv

18+阅读 · 2022年10月4日

Learning with Limited Annotations: A Survey on Deep Semi-Supervised Learning for Medical Image Segmentation

Learning with Limited Annotations: A Survey on Deep Semi-Supervised Learning for Medical Image Segmentation

Arxiv

13+阅读 · 2022年7月28日

Privacy and Robustness in Federated Learning: Attacks and Defenses

Arxiv

35+阅读 · 2020年12月7日

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection

Arxiv

13+阅读 · 2020年12月3日

Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach

Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach

Arxiv

14+阅读 · 2019年12月27日

f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning

Arxiv

11+阅读 · 2019年3月25日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

VIP会员

文章信息

相关主题

state-of-the-art

最新内容

无人机自主控制与人工智能：系统性综述

无人机自主控制与人工智能：系统性综述

专知会员服务

10+阅读 · 今天7:25

巡飞弹与反无人机系统——现代战场的两大支柱

巡飞弹与反无人机系统——现代战场的两大支柱

专知会员服务

3+阅读 · 今天6:54

《打造“黄金舰队”》57页报告

《打造“黄金舰队”》57页报告

专知会员服务

3+阅读 · 今天6:52

《北约数字教官网络发展路径》128页报告

《北约数字教官网络发展路径》128页报告

专知会员服务

2+阅读 · 今天6:33

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

7+阅读 · 6月25日

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

6+阅读 · 6月25日

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

9+阅读 · 6月25日

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

7+阅读 · 6月25日

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

8+阅读 · 6月25日

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

8+阅读 · 6月25日

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

10+阅读 · 6月25日

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

9+阅读 · 6月25日

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

9+阅读 · 6月25日

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

10+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

10+阅读 · 6月24日

相关VIP内容

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

80+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

巡飞弹与反无人机系统——现代战场的两大支柱

《北约数字教官网络发展路径》128页报告

无人机自主控制与人工智能：系统性综述

《打造“黄金舰队”》57页报告

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

相关论文

Generated Graph Detection

Arxiv

0+阅读 · 2023年6月13日

Automatic Change-Point Detection in Time Series via Deep Learning

Arxiv

0+阅读 · 2023年6月9日

Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Arxiv

0+阅读 · 2023年6月8日

A Systematic Survey on Deep Generative Models for Graph Generation

Arxiv

18+阅读 · 2022年10月4日

Learning with Limited Annotations: A Survey on Deep Semi-Supervised Learning for Medical Image Segmentation

Learning with Limited Annotations: A Survey on Deep Semi-Supervised Learning for Medical Image Segmentation

Arxiv

13+阅读 · 2022年7月28日

Privacy and Robustness in Federated Learning: Attacks and Defenses

Arxiv

35+阅读 · 2020年12月7日

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection

Arxiv

13+阅读 · 2020年12月3日

Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach

Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach

Arxiv

14+阅读 · 2019年12月27日

f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning

Arxiv

11+阅读 · 2019年3月25日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

相关基金

基于聚合的社会化短文本信息处理与细粒度倾向性分析

国家自然科学基金

0+阅读 · 2015年12月31日

长链非编码RNA-TRA调控乳腺癌内分泌耐药的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

斜硅石(moganite)高温晶体结构和相变的固体光谱学研究

国家自然科学基金

0+阅读 · 2012年12月31日

有限理性下的自媒体证券信息传播：资源价值与负面效应

国家自然科学基金

0+阅读 · 2012年12月31日

车载激光扫描点云与全景影像的高精度配准方法

国家自然科学基金

0+阅读 · 2012年12月31日

钒氧化物-石墨烯纳米复合材料的制备及催化苯羟基化性能

国家自然科学基金

0+阅读 · 2012年12月31日

DEC1、DEC2对人乳腺癌细胞衰老的调控作用及其作用机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

疏肝益肾方抗乳腺癌内分泌治疗耐药的增效机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于MEMS二维扫描镜的激光主动4D成像制导技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

Ca3Co2O6基自旋失措体系的磁台阶与磁介电效应

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员