A Guide to Misinformation Detection Datasets

Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations, or political bias. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.

翻译：虚假信息是一个复杂的社会问题，由于数据不足，缓解方案难以制定。为解决这一问题，我们整理了文献中规模最大的（虚假）信息数据集集合，总计75个。在此基础上，我们评估了其中包含陈述或主张的全部36个数据集的质量。我们通过评估来识别哪些数据集具有坚实的实证研究基础，哪些存在可能导致误导性和非普适性结果的缺陷，例如标签质量不足、伪相关性或政治偏见。我们进一步为所有这些数据集提供了最先进的基线模型，但表明无论标签质量如何，分类标签可能已无法准确评估检测模型的性能。我们讨论了缓解这一问题的替代方案。总体而言，本指南旨在为获取更高质量的数据和进行更有效的评估提供路线图，最终推动虚假信息检测研究的进步。所有数据集及其他相关资源可通过 https://misinfo-datasets.complexdatalab.com/ 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日