Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations, or political bias. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
翻译:虚假信息是一个复杂的社会问题,由于数据不足,缓解方案难以制定。为解决这一问题,我们整理了文献中规模最大的(虚假)信息数据集集合,总计75个。在此基础上,我们评估了其中包含陈述或主张的全部36个数据集的质量。我们通过评估来识别哪些数据集具有坚实的实证研究基础,哪些存在可能导致误导性和非普适性结果的缺陷,例如标签质量不足、伪相关性或政治偏见。我们进一步为所有这些数据集提供了最先进的基线模型,但表明无论标签质量如何,分类标签可能已无法准确评估检测模型的性能。我们讨论了缓解这一问题的替代方案。总体而言,本指南旨在为获取更高质量的数据和进行更有效的评估提供路线图,最终推动虚假信息检测研究的进步。所有数据集及其他相关资源可通过 https://misinfo-datasets.complexdatalab.com/ 获取。