Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.
翻译:MIT、Apache-2.0和BSD-3-Clause等宽松许可证在开源人工智能领域占据主导地位,表明模型、数据集和代码等制品可被自由使用、修改和重新分发。然而,这些许可证附带强制性要求:包含完整的许可证文本、提供版权声明并保留上游署名信息,这些要求目前尚未得到大规模验证。未能满足这些条件可能导致重用行为超出许可证的许可范围,实质上使人工智能制品在相关使用场景下受默认版权保护,并使下游用户面临诉讼风险。我们将此现象称为“宽松许可清洗”:将人工智能制品标注为可自由使用,却省略了使该标注具有法律效力所需的文件。为评估宽松许可清洗在人工智能供应链中的普遍程度,我们对Hugging Face和GitHub平台上涵盖3,338个数据集、6,664个模型和28,516个应用的124,278条数据集→模型→应用供应链进行了实证审计。研究发现:惊人的96.5%数据集和95.8%模型缺少必需的许可证文本;仅2.3%数据集和3.2%模型同时满足许可证文本和版权要求;即使上游制品提供了完整的许可证据,署名信息也极少向下游传递:仅27.59%模型保留了合规的数据集声明,仅5.75%应用保留了合规的模型声明(其中仅6.38%保留了任何关联的上游声明)。从业者不应假定宽松标签授予其声称的权利:许可证文件和声明(而非元数据)才是法律事实的依据。为支持未来研究,我们公开了完整的审计数据集与可复现流程。