The proliferation of open source software (OSS) and different types of reuse has made it incredibly difficult to perform an essential legal and compliance task of accurate license identification within the software supply chain. This study presents a reusable and comprehensive dataset of OSS licenses, created using the World of Code (WoC) infrastructure. By scanning all files containing "license" in their file paths, and applying the approximate matching via winnowing algorithm to identify the most similar license from the SPDX list, we found and identified 5.5 million distinct license blobs in OSS projects. The dataset includes a detailed project-to-license (P2L) map with commit timestamps, enabling dynamic analysis of license adoption and changes over time. To verify the accuracy of the dataset we use stratified sampling and manual review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall of 95.45%, and an F1 score of 91.11%. This dataset is intended to support a range of research and practical tasks, including the detection of license noncompliance, the investigations of license changes, study of licensing trends, and the development of compliance tools. The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.
翻译:开源软件的广泛传播及其不同类型的复用方式,使得在软件供应链中执行准确许可证识别这一关键法律与合规任务变得异常困难。本研究利用World of Code基础设施,构建了一个可复用、综合性的开源软件许可证数据集。通过扫描所有文件路径中包含"license"的文件,并应用基于winnowing算法的近似匹配技术从SPDX许可证列表中识别最相似的许可证,我们在开源项目中发现并识别了550万个独立的许可证数据块。该数据集包含详细的项目到许可证映射关系及提交时间戳,支持对许可证采用和随时间变化的动态分析。为验证数据集的准确性,我们采用分层抽样与人工审查相结合的方法,最终达到92.08%的准确率,其中精确率为87.14%,召回率为95.45%,F1分数为91.11%。本数据集旨在支持一系列研究和实践任务,包括许可证违规检测、许可证变更调查、许可趋势研究以及合规工具开发。该数据集完全开放,为开源社区的开发者、研究人员和法律专业人士提供了宝贵资源。