In the evolving landscape of clinical informatics, the integration and utilization of software tools developed through governmental funding represent a pivotal advancement in research and application. However, the dispersion of these tools across various repositories, with no centralized knowledge base, poses significant challenges to leveraging their full potential. This study introduces an automated methodology to bridge this gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv, focusing on the field of clinical informatics. Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors. The process is designed to be robust, incorporating error handling and rate limiting to ensure compliance with API constraints. Preliminary findings demonstrate the efficacy of this methodology in compiling a centralized knowledge base of NIH-funded software tools, laying the groundwork for an enriched understanding and utilization of these resources within the clinical informatics community. We propose the future integration of Large Language Models (LLMs) to generate concise summaries and evaluations of the tools. This approach facilitates the discovery and assessment of clinical informatics tools and also enables ongoing monitoring of new and actively updated repositories, revolutionizing how researchers access and leverage federally funded software. The implications of this study extend beyond simplification of access to valuable resources; it proposes a scalable model for the dynamic aggregation and evaluation of scientific software, encouraging more collaborative, transparent, and efficient research practices in clinical informatics and beyond.
翻译:在临床信息学不断发展的背景下,整合和利用政府资助开发的软件工具代表了研究与应用领域的关键进步。然而,这些工具分散于各类仓库且缺乏集中式知识库,严重制约了其潜力的充分发挥。本研究提出了一种自动化方法,通过系统提取arXiv学术论文中收录的GitHub仓库地址来弥补这一空白,聚焦临床信息学领域。我们的方法包括:通过arXiv应用程序编程接口(API)查询相关论文、清洗提取的GitHub仓库地址、经由GitHub API获取完整的仓库信息,以及基于星标数、分支数、未解决问题数和贡献者数等定义指标分析仓库成熟度。该流程具备鲁棒性设计,包含错误处理与速率限制机制以确保符合API约束条件。初步结果表明,该方法能有效编译由美国国立卫生研究院(NIH)资助的软件工具集中式知识库,为临床信息学社区更深入地理解和利用这些资源奠定基础。我们提出未来将整合大语言模型(LLMs)生成工具的精炼摘要与评估。这一方法不仅有助于发现和评估临床信息学工具,还能持续监控新增及活跃更新的仓库,彻底改变研究人员获取和利用联邦资助软件的方式。本研究的意义不仅在于简化资源获取途径,更提出了一个可扩展的动态科学软件聚合与评估模型,推动临床信息学领域乃至更广泛研究领域中的协作性、透明性和高效性研究实践。