As model parameter sizes reach the billion-level range and their training consumes zettaFLOPs of computation, components reuse and collaborative development are become increasingly prevalent in the Machine Learning (ML) community. These components, including models, software, and datasets, may originate from various sources and be published under different licenses, which govern the use and distribution of licensed works and their derivatives. However, commonly chosen licenses, such as GPL and Apache, are software-specific and are not clearly defined or bounded in the context of model publishing. Meanwhile, the reused components may also have free-content licenses and model licenses, which pose a potential risk of license noncompliance and rights infringement within the model production workflow. In this paper, we propose addressing the above challenges along two lines: 1) For license analysis, we have developed a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. 2) For standardized model publishing, we have drafted a set of model licenses that provide flexible options to meet the diverse needs of model publishing. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as a first step toward Linked Open Model Production Data. We have also encoded our proposed model licenses into rules and demonstrated the effects of GPL and other commonly used licenses in model publishing, along with the flexibility advantages of our licenses, through comparisons and experiments.
翻译:随着模型参数量级达到数十亿规模且其训练消耗泽它FLOPs级别的计算量,组件复用与协作开发在机器学习(ML)社区中日益普遍。这些组件(包括模型、软件和数据集)可能源自不同发布方,并遵循多种授权协议,这些协议规定了授权作品及其衍生品的使用与分发规则。然而,当前普遍采用的授权协议(如GPL和Apache)主要针对软件领域,在模型发布场景中缺乏明确定义与边界。同时,复用组件可能同时涉及自由内容授权与模型授权,这在模型生产工作流中可能引发授权违规与权利侵权的潜在风险。本文通过两条路径应对上述挑战:1)在授权分析方面,我们开发了面向ML工作流管理的新术语体系,并通过编码授权规则实现本体推理,以分析权利授予与合规性问题;2)在标准化模型发布方面,我们起草了一套模型授权协议,为模型发布提供满足多样化需求的灵活选择。我们的分析工具基于Turtle语言与Notation3推理引擎构建,旨在成为关联开放模型生产数据体系的基础设施。我们还将提出的模型授权协议编码为规则,通过对比实验展示了GPL等常用协议在模型发布场景中的实际影响,以及我们设计的授权协议在灵活性方面的优势。