Currently, most machine learning models are trained by centralized teams and are rarely updated. In contrast, open-source software development involves the iterative development of a shared artifact through distributed collaboration using a version control system. In the interest of enabling collaborative and continual improvement of machine learning models, we introduce Git-Theta, a version control system for machine learning models. Git-Theta is an extension to Git, the most widely used version control software, that allows fine-grained tracking of changes to model parameters alongside code and other artifacts. Unlike existing version control systems that treat a model checkpoint as a blob of data, Git-Theta leverages the structure of checkpoints to support communication-efficient updates, automatic model merges, and meaningful reporting about the difference between two versions of a model. In addition, Git-Theta includes a plug-in system that enables users to easily add support for new functionality. In this paper, we introduce Git-Theta's design and features and include an example use-case of Git-Theta where a pre-trained model is continually adapted and modified. We publicly release Git-Theta in hopes of kickstarting a new era of collaborative model development.
翻译:目前,大多数机器学习模型由集中式团队训练且极少更新。相比之下,开源软件开发通过版本控制系统进行分布式协作,实现对共享工件的迭代式开发。为促进机器学习模型的协作与持续改进,我们提出Git-Theta——一种机器学习模型版本控制系统。Git-Theta基于最广泛使用的版本控制软件Git进行扩展,能够实现对模型参数变更的细粒度追踪。与将模型检查点视为数据块的传统版本控制系统不同,Git-Theta利用检查点的结构化特性,支持通信高效更新、自动模型合并及模型版本间差异的有意义报告。此外,Git-Theta包含插件系统,使用户能轻松扩展新功能。本文介绍Git-Theta的设计与特性,并通过持续适配与修改预训练模型的实例展示其应用场景。我们公开发布Git-Theta,旨在开启协作式模型开发的新纪元。