Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature

In the evolving landscape of clinical informatics, the integration and utilization of software tools developed through governmental funding represent a pivotal advancement in research and application. However, the dispersion of these tools across various repositories, with no centralized knowledge base, poses significant challenges to leveraging their full potential. This study introduces an automated methodology to bridge this gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv, focusing on the field of clinical informatics. Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors. The process is designed to be robust, incorporating error handling and rate limiting to ensure compliance with API constraints. Preliminary findings demonstrate the efficacy of this methodology in compiling a centralized knowledge base of NIH-funded software tools, laying the groundwork for an enriched understanding and utilization of these resources within the clinical informatics community. We propose the future integration of Large Language Models (LLMs) to generate concise summaries and evaluations of the tools. This approach facilitates the discovery and assessment of clinical informatics tools and also enables ongoing monitoring of new and actively updated repositories, revolutionizing how researchers access and leverage federally funded software. The implications of this study extend beyond simplification of access to valuable resources; it proposes a scalable model for the dynamic aggregation and evaluation of scientific software, encouraging more collaborative, transparent, and efficient research practices in clinical informatics and beyond.

翻译：在临床信息学不断发展的背景下，整合和利用政府资助开发的软件工具代表了研究与应用领域的关键进步。然而，这些工具分散于各类仓库且缺乏集中式知识库，严重制约了其潜力的充分发挥。本研究提出了一种自动化方法，通过系统提取arXiv学术论文中收录的GitHub仓库地址来弥补这一空白，聚焦临床信息学领域。我们的方法包括：通过arXiv应用程序编程接口（API）查询相关论文、清洗提取的GitHub仓库地址、经由GitHub API获取完整的仓库信息，以及基于星标数、分支数、未解决问题数和贡献者数等定义指标分析仓库成熟度。该流程具备鲁棒性设计，包含错误处理与速率限制机制以确保符合API约束条件。初步结果表明，该方法能有效编译由美国国立卫生研究院（NIH）资助的软件工具集中式知识库，为临床信息学社区更深入地理解和利用这些资源奠定基础。我们提出未来将整合大语言模型（LLMs）生成工具的精炼摘要与评估。这一方法不仅有助于发现和评估临床信息学工具，还能持续监控新增及活跃更新的仓库，彻底改变研究人员获取和利用联邦资助软件的方式。本研究的意义不仅在于简化资源获取途径，更提出了一个可扩展的动态科学软件聚合与评估模型，推动临床信息学领域乃至更广泛研究领域中的协作性、透明性和高效性研究实践。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日