Preprints play an increasingly critical role in academic communities. There are many reasons driving researchers to post their manuscripts to preprint servers before formal submission to journals or conferences, but the use of preprints has also sparked considerable controversy, especially surrounding the claim of priority. In this paper, a case study of computer science preprints submitted to arXiv from 2008 to 2017 is conducted to quantify how many preprints have eventually been printed in peer-reviewed venues. Among those published manuscripts, some are published under different titles and without an update to their preprints on arXiv. In the case of these manuscripts, the traditional fuzzy matching method is incapable of mapping the preprint to the final published version. In view of this issue, we introduce a semantics-based mapping method with the employment of Bidirectional Encoder Representations from Transformers (BERT). With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to investigate why these preprints but not others were accepted for publication. Our comparison reveals that in the field of computer science, published preprints feature adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.
翻译:预印本在学术界扮演着日益关键的角色。研究者们在正式向期刊或会议投稿前将手稿发布到预印本服务器,背后存在诸多驱动因素,但预印本的使用也引发了广泛争议,尤其围绕优先权主张问题。本文以2008至2017年间提交至arXiv的计算机科学预印本为案例,量化分析最终发表在同行评审渠道的预印本比例。在这些已发表手稿中,部分采用了不同标题,且未在arXiv上更新对应预印本。针对此类手稿,传统模糊匹配方法无法将预印本映射至最终发表版本。为解决该问题,我们引入基于语义的映射方法,采用双向编码器表示模型(BERT)。通过这一新映射方法与多渠道数据来源,我们发现:66%的抽样预印本以原标题发表,11%则以不同标题及修改内容发表。进而分析这些预印本为何能获得发表而其他预印本未能发表。对比结果显示,在计算机科学领域,已发表的预印本具有充分修订、多作者合作、详细摘要与引言、广泛权威参考文献及可用源代码等特征。