Retractions play a vital role in maintaining scientific integrity, yet systematic studies of retractions in computer science and other STEM fields remain scarce. We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv, containing over 14,000 papers and their associated retraction comments spanning the repository's entire history through September 2024. Through careful analysis of author comments, we develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations. We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy, an enriched version including scripts for parsed full-text PDFs, specifically designed to enable research in scientific feasibility studies, claim verification, and automated theorem proving. These findings provide valuable insights for improving scientific quality control and automated verification systems. Finally, and most importantly, we discuss ethical issues and take a number of steps to implement responsible data release while fostering open science in this area.
翻译:撤稿在维护科学诚信方面发挥着至关重要的作用,然而,针对计算机科学及其他STEM领域撤稿的系统性研究仍然匮乏。本文介绍了WithdrarXiv,这是首个来自arXiv的大规模撤稿论文数据集,包含超过14,000篇论文及其相关的撤稿说明,时间跨度覆盖该知识库截至2024年9月的完整历史。通过对作者评论的细致分析,我们构建了一个全面的撤稿原因分类体系,识别出从关键错误到违反政策等10个不同的类别。我们展示了一种简单而高精度的零样本自动撤稿原因分类方法,其加权平均F1分数达到0.96。此外,我们发布了WithdrarXiv-SciFy,这是一个增强版本,包含用于解析全文PDF的脚本,专门设计用于支持科学可行性研究、声明验证和自动定理证明等领域的研究。这些发现为改进科学质量控制和自动化验证系统提供了宝贵的见解。最后,也是最重要的,我们讨论了相关的伦理问题,并采取了一系列措施,在推动该领域开放科学的同时,实施负责任的数据发布。