In the evolving landscape of clinical informatics, the integration and utilization of software tools developed through governmental funding represent a pivotal advancement in research and application. However, the dispersion of these tools across various repositories, with no centralized knowledge base, poses significant challenges to leveraging their full potential. This study introduces an automated methodology to bridge this gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv, focusing on the field of clinical informatics. Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors. The process is designed to be robust, incorporating error handling and rate limiting to ensure compliance with API constraints. Preliminary findings demonstrate the efficacy of this methodology in compiling a centralized knowledge base of NIH-funded software tools, laying the groundwork for an enriched understanding and utilization of these resources within the clinical informatics community. We propose the future integration of Large Language Models (LLMs) to generate concise summaries and evaluations of the tools. This approach facilitates the discovery and assessment of clinical informatics tools and also enables ongoing monitoring of new and actively updated repositories, revolutionizing how researchers access and leverage federally funded software. The implications of this study extend beyond simplification of access to valuable resources; it proposes a scalable model for the dynamic aggregation and evaluation of scientific software, encouraging more collaborative, transparent, and efficient research practices in clinical informatics and beyond.
翻译:在临床信息学不断发展的背景下,通过政府资助开发的软件工具的集成与利用代表了研究与应用领域的关键进步。然而,这些工具分散于各个资源库中,缺乏集中的知识库,这对其潜力的充分发挥构成了重大挑战。本研究提出了一种自动化方法,通过系统地从arXiv索引的学术论文中提取GitHub资源库URL,聚焦于临床信息学领域,以弥合这一差距。我们的方法包括:查询arXiv API以获取相关论文、清洗提取的GitHub URL、通过GitHub API获取全面的资源库信息,并基于定义的指标(如星标数、分支数、未解决问题和贡献者数)分析资源库的成熟度。该流程设计稳健,包含错误处理和速率限制以确保符合API约束。初步结果证明了该方法在编译NIH资助软件工具的集中知识库方面的有效性,为临床信息学社区更深入地理解和利用这些资源奠定了基础。我们建议未来集成大型语言模型(LLMs),以生成简洁的工具摘要和评估。该方法不仅促进了临床信息学工具的发现与评估,还能持续监控新增和活跃更新的资源库,彻底改变研究人员获取和利用联邦资助软件的方式。本研究的意义不仅在于简化对宝贵资源的访问,还提出了一个可扩展的科学软件动态聚合与评估模型,推动临床信息学及更广泛领域实现更协作、透明和高效的研究实践。