Considerable scientific work involves locating, analyzing, systematizing, and synthesizing other publications. Its results end up in a paper's "background" section or in standalone articles, which include meta-analyses and systematic literature reviews. The required research is aided through the use of online scientific publication databases and search engines, such as Web of Science, Scopus, and Google Scholar. However, use of online databases suffers from a lack of repeatability and transparency, as well as from technical restrictions. Thankfully, open data, powerful personal computers, and open source software now make it possible to run sophisticated publication studies on the desktop in a self-contained environment that peers can readily reproduce. Here we report a Python software package and an associated command-line tool that can populate embedded relational databases with slices from the complete set of Crossref publication metadata, ORCID author records, and other open data sets, for in-depth processing through performant queries. We demonstrate the software's utility by analyzing the underlying dataset's contents, by visualizing the evolution of publications in diverse scientific fields and relationships among them, by outlining scientometric facts associated with COVID-19 research, and by replicating commonly-used bibliometric measures of productivity, impact, and disruption.
翻译:相当多的科学工作涉及定位、分析、系统化及综合其他出版物,其结果最终形成论文的“背景”部分或独立文章(包括荟萃分析和系统性文献综述)。此类研究通过使用在线科学出版物数据库和搜索引擎(如Web of Science、Scopus和Google Scholar)得到辅助。然而,在线数据库的使用存在可重复性不足、透明度缺失以及技术限制等问题。所幸,开放数据、高性能个人计算机和开源软件如今使得在桌面端运行高度自足的复杂出版研究成为可能,且同行可便捷地复现这些研究。本文报告了一个Python软件包及其关联的命令行工具,该工具可将来自Crossref完整出版元数据、ORCID作者记录及其他开放数据集的切片填充至嵌入式关系数据库中,以便通过高效查询进行深度处理。我们通过分析底层数据集内容、可视化不同科学领域出版物的演进及其相互关系、概述与COVID-19研究相关的科学计量事实,以及复现常用的生产力、影响力和颠覆性文献计量指标,展示了该软件的实用性。