We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.
翻译:摘要:我们提出SemRepo,一个包含超过8100万个三元组的RDF知识图谱,描述了近20万个与科学研究相关的GitHub仓库。SemRepo捕获了仓库级元数据(如贡献者、问题和编程语言),并将这些信息与外部学术知识图谱进行互连。具体而言,仓库作者关联至SemOpenAlex中的个人资料,仓库连接至LPWC中的学术出版物,而研究制品(如数据集和实验)则通过MLSea-KG进行链接。这种集成使得跨出版物流及其学术制品的查询成为可能,而此类信息通常分散在不同的平台上。SemRepo支持现有孤立资源难以实现的分析,包括跨仓库与出版物的溯源重建,以及系统识别研究可复现性与软件可持续性的风险。通过将研究软件及其学术上下文统一至单一图谱中,SemRepo为更广泛的科学研究生态系统中软件的大规模分析提供了重要基础设施。