宽松许可清洗：开放人工智能供应链中的许可证完整性大规模审计 (Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity)

Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.

翻译：MIT、Apache-2.0和BSD-3-Clause等宽松许可证在开源人工智能领域占据主导地位，表明模型、数据集和代码等制品可被自由使用、修改和重新分发。然而，这些许可证附带强制性要求：包含完整的许可证文本、提供版权声明并保留上游署名信息，这些要求目前尚未得到大规模验证。未能满足这些条件可能导致重用行为超出许可证的许可范围，实质上使人工智能制品在相关使用场景下受默认版权保护，并使下游用户面临诉讼风险。我们将此现象称为“宽松许可清洗”：将人工智能制品标注为可自由使用，却省略了使该标注具有法律效力所需的文件。为评估宽松许可清洗在人工智能供应链中的普遍程度，我们对Hugging Face和GitHub平台上涵盖3,338个数据集、6,664个模型和28,516个应用的124,278条数据集→模型→应用供应链进行了实证审计。研究发现：惊人的96.5%数据集和95.8%模型缺少必需的许可证文本；仅2.3%数据集和3.2%模型同时满足许可证文本和版权要求；即使上游制品提供了完整的许可证据，署名信息也极少向下游传递：仅27.59%模型保留了合规的数据集声明，仅5.75%应用保留了合规的模型声明（其中仅6.38%保留了任何关联的上游声明）。从业者不应假定宽松标签授予其声称的权利：许可证文件和声明（而非元数据）才是法律事实的依据。为支持未来研究，我们公开了完整的审计数据集与可复现流程。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《防务领域人工智能可信赖性：为防务开发负责任、符合伦理且可信赖的AI系统》欧洲防务局2025最新107页

专知会员服务

22+阅读 · 2025年5月14日

《可信赖的企业级生成式AI白皮书》,195页pdf

专知会员服务

42+阅读 · 2024年6月4日

聚焦AIGC高质量发展新未来高金智库发布《生成式人工智能服务合规发展白皮书》，88页pdf

专知会员服务

68+阅读 · 2023年10月1日