An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.
翻译:端到端机器学习生命周期包含众多迭代过程,涵盖从数据准备、机器学习模型设计、模型训练到部署训练模型进行推理的完整流程。在为某个机器学习问题构建端到端生命周期时,需要设计与执行大量产生海量生命周期版本的机器学习流水线。为此,本文提出VeML——一套专用于端到端机器学习生命周期的版本管理系统。该系统解决了其他系统尚未处理的若干关键问题。首先,我们针对构建机器学习生命周期的高成本问题(尤其面向大规模高维数据集),提出通过迁移系统中管理的相似数据集的生命周期至新训练数据的方法。我们基于核心集设计了一种能高效计算大规模高维数据相似度的算法。另一个关键问题是模型生命周期中因训练数据与测试数据差异导致的精度下降,这会触发生命周期重建。本系统无需从测试数据获取标注样本即可检测此数据失配问题,并针对新数据版本重建机器学习生命周期。为验证贡献,我们基于真实世界的大规模驾驶图像数据集与时空传感器数据集开展实验,取得了令人满意的结果。