PreprintToPaper dataset: connecting bioRxiv preprints with journal publications

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. We selected the two periods to capture preprint-publication dynamics before and during the COVID-19 pandemic while avoiding transitional years. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. In addition to the main dataset, a version-history subset provides all available versions of preprints within the two selected periods, enabling analysis of how preprints evolve over time. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (posted on a preprint server), and Gray Zone (potentially published in a journal but unlinked). To enhance reliability, title and author similarity scores were computed, and a human-annotated subset of 299 records was created to evaluate Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and the corresponding journal articles. The dataset is publicly available in CSV format via Zenodo.

翻译：PreprintToPaper数据集将bioRxiv预印本与其对应的期刊出版物进行关联，支持对预印本至正式发表过程的大规模分析。该数据集包含通过bioRxiv和Crossref API获取的两个时期（2016-2018年[疫情前]与2020-2022年[疫情期间]）共145,517篇预印本的元数据。选取这两个时期旨在捕捉COVID-19疫情前后预印本发表动态，同时规避过渡年份。每条记录均包含标题、摘要、作者、机构、提交日期、许可协议及学科分类等书目信息，以及期刊名称、出版日期、作者列表等增强的发表元数据。除主数据集外，版本历史子集还提供选定时期内预印本的所有可用版本，支持分析预印本的时序演变。预印本被划分为三类：已发表（与期刊文章正式关联）、仅预印本（仅发布于预印本服务器）和灰色地带（可能已在期刊发表但未关联）。为提升可靠性，本研究计算了标题与作者相似度评分，并创建了包含299条记录的人工标注子集以评估灰色地带案例。该数据集支持多类应用研究，包括学术传播分析、开放科学政策评估、文献计量工具开发，以及针对预印本与期刊文章文本差异的自然语言处理研究。数据集以CSV格式通过Zenodo平台公开提供。