WithdrarXiv: A Large-Scale Dataset for Retraction Study

Retractions play a vital role in maintaining scientific integrity, yet systematic studies of retractions in computer science and other STEM fields remain scarce. We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv, containing over 14,000 papers and their associated retraction comments spanning the repository's entire history through September 2024. Through careful analysis of author comments, we develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations. We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy, an enriched version including scripts for parsed full-text PDFs, specifically designed to enable research in scientific feasibility studies, claim verification, and automated theorem proving. These findings provide valuable insights for improving scientific quality control and automated verification systems. Finally, and most importantly, we discuss ethical issues and take a number of steps to implement responsible data release while fostering open science in this area.

翻译：撤稿在维护科学诚信方面发挥着至关重要的作用，然而，针对计算机科学及其他STEM领域撤稿的系统性研究仍然匮乏。本文介绍了WithdrarXiv，这是首个来自arXiv的大规模撤稿论文数据集，包含超过14,000篇论文及其相关的撤稿说明，时间跨度覆盖该知识库截至2024年9月的完整历史。通过对作者评论的细致分析，我们构建了一个全面的撤稿原因分类体系，识别出从关键错误到违反政策等10个不同的类别。我们展示了一种简单而高精度的零样本自动撤稿原因分类方法，其加权平均F1分数达到0.96。此外，我们发布了WithdrarXiv-SciFy，这是一个增强版本，包含用于解析全文PDF的脚本，专门设计用于支持科学可行性研究、声明验证和自动定理证明等领域的研究。这些发现为改进科学质量控制和自动化验证系统提供了宝贵的见解。最后，也是最重要的，我们讨论了相关的伦理问题，并采取了一系列措施，在推动该领域开放科学的同时，实施负责任的数据发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日