Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX Licenses

The widespread adoption of third-party libraries (TPLs) in software development has accelerated the creation of modern software. However, this convenience comes with potential legal risks. Developers may inadvertently violate the licenses of TPLs, leading to legal issues. While existing studies have explored software licenses and potential incompatibilities, these studies often focus on a limited set of licenses or rely on low-quality license data, which may affect their conclusions. To address this gap, there is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses to help developers navigate the complex landscape of software licenses, avoid potential legal pitfalls, and guide solutions for managing license compliance and compatibility in software development. To this end, we conduct the first work to understand the mainstream software licenses based on term granularity and obtain a high-quality dataset of 453 SPDX licenses with well-labeled terms and conflicts. Specifically, we first conduct a differential analysis of the mainstream platforms to understand the terms and attitudes of each license. Next, we propose a standardized set of license terms to capture and label existing mainstream licenses with high quality. Moreover, we include copyleft conflicts and conclude the three major types of license conflicts among the 453 SPDX licenses. Based on these, we carry out two empirical studies to reveal the concerns and threats from the perspectives of both licensors and licensees. One study provides an in-depth analysis of the similarities, differences, and conflicts among SPDX licenses, revisits the usage and conflicts of licenses in the NPM ecosystem, and draws conclusions that differ from previous work. Our studies reveal some insightful findings and disclose relevant analytical data, which set the stage for further research.

翻译：第三方库（TPL）在软件开发中的广泛普及加速了现代软件的构建。然而，这种便利性伴随着潜在的法律风险。开发者可能无意中违反TPL的许可证条款，从而引发法律问题。尽管现有研究已探讨软件许可证及其潜在不兼容性，但这些研究往往局限于有限的许可证集或依赖低质量许可证数据，这可能影响其结论的可靠性。为填补这一空白，亟需构建一个涵盖广泛主流许可证的高质量许可证数据集，以帮助开发者在复杂的软件许可证环境中规避潜在法律陷阱，并为管理软件开发中的许可证合规性与兼容性提供解决方案。为此，我们开展了首项基于条款粒度理解主流软件许可证的工作，并构建了一个包含453个SPDX许可证且标注了条款与冲突的高质量数据集。具体而言，我们首先对主流平台进行差异分析，以理解每个许可证的条款与态度。其次，我们提出一套标准化的许可证条款集，用于高质量地捕获并标注现有主流许可证。此外，我们纳入了版权左倾冲突，并总结了453个SPDX许可证中三种主要类型的许可证冲突。基于此，我们开展了两项实证研究，从许可方与被许可方视角揭示相关关切与威胁。其中一项研究深入分析了SPDX许可证间的相似性、差异性与冲突，重新审视了NPM生态系统中许可证的使用与冲突情况，并得出了与先前研究不同的结论。我们的研究揭示了一些富有洞见的发现，并公开了相关分析数据，为后续研究奠定了基础。