Considerable scientific work involves locating, analyzing, systematizing, and synthesizing other publications. Its results end up in a paper's "background" section or in standalone articles, which include meta-analyses and systematic literature reviews. The required research is aided through the use of online scientific publication databases and search engines, such as Web of Science, Scopus, and Google Scholar. However, use of online databases suffers from a lack of repeatability and transparency, as well as from technical restrictions. Thankfully, open data, powerful personal computers, and open source software now make it possible to run sophisticated publication studies on the desktop in a self-contained environment that peers can readily reproduce. Here we report a Python software package and an associated command-line tool that can populate embedded relational databases with slices from the complete set of Crossref publication metadata, ORCID author records, and other open data sets, for in-depth processing through performant queries. We demonstrate the software's utility by analyzing the underlying dataset's contents, by visulizing the evolution of publications in diverse scientific fields and relationships between them, by outlining scientometric facts associated with COVID-19 research, and by replicating commonly-used bibliometric measures of productivity and impact.
翻译:大量科学工作涉及定位、分析、系统化整合其他出版物,其成果最终呈现于论文的“背景”部分或独立文章(包括元分析与系统文献综述)。此类研究依赖在线科学出版物数据库与搜索引擎(如Web of Science、Scopus与Google Scholar)的支持。然而,在线数据库的使用存在可重复性不足、透明度缺失及技术限制等问题。值得庆幸的是,开放数据、高性能个人计算机与开源软件现已能够支持在桌面端独立环境中运行复杂的出版研究,且该环境可被同行轻松复现。本文报告一款Python软件包及配套命令行工具,该工具能够从完整的Crossref出版物元数据、ORCID作者记录及其他开放数据集中提取数据片段,填充嵌入式关系数据库,以支持通过高性能查询进行深度处理。我们通过分析底层数据集内容、可视化不同科学领域出版物演变及其关联、概述COVID-19研究相关科学计量事实、以及复现生产力与影响力的常用文献计量指标,展示了该软件的实用价值。