Software Heritage is the largest public archive of software source code and associated development history, as captured by modern version control systems. As of July 2023, it has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects. In this chapter, we describe the Software Heritage ecosystem, focusing on research and open science use cases.On the one hand, Software Heritage supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code. This giant graph of source code artifacts (files, directories, and commits) can be used-and has been used-to study repository forks, open source contributors, vulnerability propagation, software provenance tracking, source code indexing, and more.On the other hand, Software Heritage ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments, contributing to making research reproducible. The source code used in scientific experiments can be archived-e.g., via integration with open-access repositories-referenced using persistent identifiers that allow downstream integrity checks and linked to/from other scholarly digital artifacts.
翻译:软件遗产是目前最大的公开软件源代码及其开发历史档案库,其内容由现代版本控制系统捕获。截至2023年7月,该档案库已收录超过160亿个来自2.5亿个协作开发项目的独立源代码文件。本章将阐述软件遗产生态系统,重点关注其研究与开放科学应用场景。一方面,软件遗产通过将公共代码的开发历史整合为单一默克尔有向无环图,支持基于软件的实证研究。这一包含源代码制品(文件、目录及提交记录)的巨型图结构可用于(且已被用于)研究代码库分叉、开源贡献者行为、漏洞传播路径、软件来源追踪、源代码索引等课题。另一方面,软件遗产为所有依赖软件开展实验的学科领域提供源代码制品的可用性保障与完整性验证,助力实现可重复研究。科学实验中使用的源代码可通过归档(例如集成开放获取知识库)赋予持久标识符,这些标识符支持下游完整性校验,并能与其他学术数字制品建立双向关联。