Assessing the impact of Open Research Information Infrastructures using NLP driven full-text Scientometrics: A case study of the LXCat open-access platform

翻译：基于自然语言处理驱动的全文科学计量学评估开放研究信息基础设施的影响：以LXCat开放获取平台为例

Kalp Pandya,Khushi Shah,Nirmal Shah,Nakshi Shah,Bhaskar Chaudhury

Open research information (ORI) play a central role in shaping how scientific knowledge is produced, disseminated, validated, and reused across the research lifecycle. While the visibility of such ORI infrastructures is often assessed through citation-based metrics, in this study, we present a full-text, natural language processing (NLP) driven scientometric framework to systematically quantify the impact of ORI infrastructures beyond citation counts, using the LXCat platform for low temperature plasma (LTP) research as a representative case study. The modeling of LTPs and interpretation of LTP experiments rely heavily on accurate data, much of which is hosted on LXCat, a community-driven, open-access platform central to the LTP research ecosystem. To investigate the scholarly impact of the LXCat platform over the past decade, we analyzed a curated corpus of full-text research articles citing three foundational LXCat publications. We present a comprehensive pipeline that integrates chemical entity recognition, dataset and solver mention extraction, affiliation based geographic mapping and topic modeling to extract fine-grained patterns of data usage that reflect implicit research priorities, data practices, differential reliance on specific databases, evolving modes of data reuse and coupling within scientific workflows, and thematic evolution. Importantly, our proposed methodology is domain-agnostic and transferable to other ORI contexts, and highlights the utility of NLP in quantifying the role of scientific data infrastructures and offers a data-driven reflection on how open-access platforms like LXCat contribute to shaping research directions. This work presents a scalable scientometric framework that has the potential to support evidence based evaluation of ORI platforms and to inform infrastructure design, governance, sustainability, and policy for future development.

翻译：开放研究信息（ORI）在塑造科学知识于研究生命周期中如何产生、传播、验证和重用方面发挥着核心作用。尽管此类ORI基础设施的可见性通常通过基于引用的指标进行评估，但本研究提出了一个全文、自然语言处理（NLP）驱动的科学计量框架，以系统量化ORI基础设施超越引用计数的影响，并以低温等离子体（LTP）研究的LXCat平台作为代表性案例进行研究。LTP的建模和LTP实验的解释严重依赖于准确的数据，其中大部分数据托管在LXCat上——这是一个对LTP研究生态系统至关重要的社区驱动、开放获取平台。为了调查过去十年中LXCat平台的学术影响，我们分析了一个精选的全文研究文章语料库，这些文章引用了三篇基础的LXCat出版物。我们提出了一个综合流程，该流程整合了化学实体识别、数据集与求解器提及提取、基于隶属机构的地理映射以及主题建模，以提取反映隐含研究优先级、数据实践、对特定数据库的差异化依赖、数据重用模式的演变以及科学工作流内耦合关系的细粒度数据使用模式。重要的是，我们提出的方法具有领域无关性，可迁移到其他ORI情境中，并凸显了NLP在量化科学数据基础设施作用方面的效用，为像LXCat这样的开放获取平台如何促进塑造研究方向提供了数据驱动的反思。这项工作提出了一个可扩展的科学计量框架，该框架有潜力支持对ORI平台进行基于证据的评估，并为未来的基础设施设计、治理、可持续性和政策制定提供信息。