As model parameter sizes scale into the billions and training consumes zettaFLOPs of computation, the reuse of Machine Learning (ML) assets and collaborative development have become increasingly prevalent in the ML community. These ML assets, including models, datasets, and software, may originate from various sources and be published under different licenses, which govern the use and distribution of licensed works and their derivatives. However, commonly chosen licenses, such as GPL and Apache, are software-specific and are not clearly defined or bounded in the context of model publishing. Meanwhile, the reused assets may also be under free-content licenses and model licenses, which pose a potential risk of license noncompliance and rights infringement within the model production workflow. In this paper, we address these challenges along two lines: 1) For ML workflow compliance, we propose ModelGo (MG) Analyzer, a tool that incorporates a vocabulary for ML workflow management and encoded license rules, enabling ontological reasoning to analyze rights granting and compliance issues. 2) For standardized model publishing, we introduce ModelGo Licenses, a set of modell-specific licenses that provide flexible options to meet the diverse needs of the ML community. MG Analyzer is built on Turtle language and Notation3 reasoning engine, envisioned as a first step toward Linked Open Data for ML workflow management. We have also encoded our proposed model licenses into rules and demonstrated the effects of GPL and other commonly used licenses in model publishing, along with the flexibility advantages of our licenses, through comparisons and experiments.
翻译:随着模型参数量级攀升至数十亿且训练消耗泽塔级浮点运算,机器学习(ML)资产复用与协作开发在ML社区中日益普遍。这些ML资产(包括模型、数据集和软件)可能源自不同发布方,并遵循多种授权协议,这些协议规定了授权作品及其衍生品的使用与分发规则。然而,当前普遍采用的授权协议(如GPL和Apache)主要针对软件设计,在模型发布场景中缺乏明确定义与边界。与此同时,复用资产可能同时受自由内容协议与模型协议约束,这在模型生产工作流中潜藏着协议违规与权利侵权的风险。本文从两个维度应对这些挑战:1)针对ML工作流合规性,我们提出ModelGo(MG)分析器——该工具整合了ML工作流管理专用词汇表与编码化协议规则,支持通过本体推理分析权利授予与合规问题;2)针对标准化模型发布,我们推出ModelGo协议系列,这是一组专为模型设计的授权协议,为ML社区提供满足多样化需求的灵活选择。MG分析器基于Turtle语言与Notation3推理引擎构建,旨在为ML工作流管理的关联开放数据体系奠定基础。我们还将提出的模型协议编码为规则,通过对比实验展示了GPL等常用协议在模型发布中的实际影响,以及本协议系列的灵活性优势。