PreprintResolver: Improving Citation Quality by Resolving Published Versions of ArXiv Preprints using Literature Databases

The growing impact of preprint servers enables the rapid sharing of time-sensitive research. Likewise, it is becoming increasingly difficult to distinguish high-quality, peer-reviewed research from preprints. Although preprints are often later published in peer-reviewed journals, this information is often missing from preprint servers. To overcome this problem, the PreprintResolver was developed, which uses four literature databases (DBLP, SemanticScholar, OpenAlex, and CrossRef / CrossCite) to identify preprint-publication pairs for the arXiv preprint server. The target audience focuses on, but is not limited to inexperienced researchers and students, especially from the field of computer science. The tool is based on a fuzzy matching of author surnames, titles, and DOIs. Experiments were performed on a sample of 1,000 arXiv-preprints from the research field of computer science and without any publication information. With 77.94 %, computer science is highly affected by missing publication information in arXiv. The results show that the PreprintResolver was able to resolve 603 out of 1,000 (60.3 %) arXiv-preprints from the research field of computer science and without any publication information. All four literature databases contributed to the final result. In a manual validation, a random sample of 100 resolved preprints was checked. For all preprints, at least one result is plausible. For nine preprints, more than one result was identified, three of which are partially invalid. In conclusion the PreprintResolver is suitable for individual, manually reviewed requests, but less suitable for bulk requests. The PreprintResolver tool (https://preprintresolver.eu, Available from 2023-08-01) and source code (https://gitlab.com/ippolis_wp3/preprint-resolver, Accessed: 2023-07-19) is available online.

翻译：摘要：预印本服务器日益增长的影响力促进了时效性研究的快速共享，但同时也使得区分高质量同行评审研究与预印本变得越来越困难。尽管预印本后续通常会在经同行评审的期刊上正式发表，但预印本服务器往往缺失这一出版信息。为解决此问题，我们开发了PreprintResolver工具，该工具利用四个文献数据库（DBLP、SemanticScholar、OpenAlex以及CrossRef/CrossCite）为arXiv预印本服务器识别预印本与已出版论文的配对关系。目标用户群体主要为缺乏经验的研究人员及学生，尤其聚焦于计算机科学领域。该工具基于作者姓氏、论文标题及DOI的模糊匹配算法实现。我们以计算机科学领域中1000篇无任何出版信息的arXiv预印本为样本进行了实验。结果表明，arXiv中计算机科学领域缺失出版信息的比例高达77.94%。PreprintResolver能够成功解析这1000篇样本中603篇（占比60.3%）预印本对应的出版信息。四个文献数据库均对最终结果有所贡献。通过人工校验，我们随机抽查了100篇已解析的预印本：所有预印本至少有一个结果可信；其中9篇预印本识别出多个结果，但其中3个结果部分无效。结论表明，PreprintResolver适用于单篇人工审核的查询请求，但不适用于批量请求场景。该工具（https://preprintresolver.eu，自2023年8月1日起开放）及其源代码（https://gitlab.com/ippolis_wp3/preprint-resolver，访问日期：2023年7月19日）均已在线公开。