Machine learning training data is often dynamic in real-world use cases, i.e., data is added or removed and may experience distribution shifts over time. Models must incorporate this evolving training data to improve generalization, adapt to potential distribution shifts, and adhere to privacy regulations. However, the cost of model (re)training is proportional to how often the model trains and on how much data it trains on. While ML research explores these topics in isolation, there is no end-to-end open-source platform to facilitate the exploration of model retraining and data selection policies and the deployment these algorithms efficiently at scale. We present Modyn, a platform for model training on dynamic datasets that enables sample-level data selection and triggering policies. Modyn orchestrates continuous training pipelines while optimizing the underlying system infrastructure to support fast access to arbitrary data samples for efficient data selection. Modyn's extensible architecture allows users to run training pipelines without modifying the platform code, and enables researchers to effortlessly extend the system. We evaluate Modyn's training throughput, showing that even in memory-bound recommendation systems workloads, Modyn is able to reach 80 to 100 % of the throughput compared to loading big chunks of data locally without sample-level data selection. Additionally, we showcase Modyn's functionality with three different data selection policies.
翻译:在真实应用场景中,机器学习训练数据通常具有动态性,即数据会随时间发生增删并可能出现分布偏移。模型需要整合这种持续演变的训练数据,以提升泛化能力、适应潜在分布偏移并遵守隐私法规。然而,模型(重)训练的成本与训练频率及数据规模成正比。尽管机器学习研究已分别探索这些课题,但目前尚无端到端的开源平台能够系统性地研究模型重训练与数据选择策略,并实现这些算法的高效规模化部署。本文提出Modyn——一个面向动态数据集的模型训练平台,支持样本级数据选择与触发策略。Modyn通过编排持续训练流水线,同时优化底层系统基础设施以支持对任意数据样本的快速访问,从而实现高效数据选择。该平台采用可扩展架构,用户无需修改平台代码即可运行训练流水线,研究人员也能轻松扩展系统功能。我们评估了Modyn的训练吞吐量,结果表明:即使在受内存限制的推荐系统工作负载中,与未启用样本级数据选择的本地大数据块加载方案相比,Modyn仍能达到80%至100%的吞吐性能。此外,我们通过三种不同的数据选择策略展示了Modyn的功能特性。