OSS License Identification at Scale: A Comprehensive Dataset Using World of Code

The proliferation of open source software (OSS) and different types of reuse has made it incredibly difficult to perform an essential legal and compliance task of accurate license identification within the software supply chain. This study presents a reusable and comprehensive dataset of OSS licenses, created using the World of Code (WoC) infrastructure. By scanning all files containing "license" in their file paths, and applying the approximate matching via winnowing algorithm to identify the most similar license from the SPDX list, we found and identified 5.5 million distinct license blobs in OSS projects. The dataset includes a detailed project-to-license (P2L) map with commit timestamps, enabling dynamic analysis of license adoption and changes over time. To verify the accuracy of the dataset we use stratified sampling and manual review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall of 95.45%, and an F1 score of 91.11%. This dataset is intended to support a range of research and practical tasks, including the detection of license noncompliance, the investigations of license changes, study of licensing trends, and the development of compliance tools. The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.

翻译：开源软件的广泛传播及其不同类型的复用方式，使得在软件供应链中执行准确许可证识别这一关键法律与合规任务变得异常困难。本研究利用World of Code基础设施，构建了一个可复用、综合性的开源软件许可证数据集。通过扫描所有文件路径中包含"license"的文件，并应用基于winnowing算法的近似匹配技术从SPDX许可证列表中识别最相似的许可证，我们在开源项目中发现并识别了550万个独立的许可证数据块。该数据集包含详细的项目到许可证映射关系及提交时间戳，支持对许可证采用和随时间变化的动态分析。为验证数据集的准确性，我们采用分层抽样与人工审查相结合的方法，最终达到92.08%的准确率，其中精确率为87.14%，召回率为95.45%，F1分数为91.11%。本数据集旨在支持一系列研究和实践任务，包括许可证违规检测、许可证变更调查、许可趋势研究以及合规工具开发。该数据集完全开放，为开源社区的开发者、研究人员和法律专业人士提供了宝贵资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日