VeML: An End-to-End Machine Learning Lifecycle for Large-scale and High-dimensional Data

from arxiv, The updated version of this paper, titled "Efficient ML Lifecycle Transferring for Large-scale and High-dimensional Data via Core Set-based Dataset Similarity," has been accepted for publication in IEEE Access

An end-to-end machine learning (ML) lifecycle consists of many iterative processes, from data preparation and ML model design to model training and then deploying the trained model for inference. When building an end-to-end lifecycle for an ML problem, many ML pipelines must be designed and executed that produce a huge number of lifecycle versions. Therefore, this paper introduces VeML, a Version management system dedicated to end-to-end ML Lifecycle. Our system tackles several crucial problems that other systems have not solved. First, we address the high cost of building an ML lifecycle, especially for large-scale and high-dimensional dataset. We solve this problem by proposing to transfer the lifecycle of similar datasets managed in our system to the new training data. We design an algorithm based on the core set to compute similarity for large-scale, high-dimensional data efficiently. Another critical issue is the model accuracy degradation by the difference between training data and testing data during the ML lifetime, which leads to lifecycle rebuild. Our system helps to detect this mismatch without getting labeled data from testing data and rebuild the ML lifecycle for a new data version. To demonstrate our contributions, we conduct experiments on real-world, large-scale datasets of driving images and spatiotemporal sensor data and show promising results.

翻译：一个端到端机器学习生命周期包含多个迭代过程，从数据准备、机器学习模型设计，到模型训练，再到部署已训练模型进行推理。在针对某一机器学习问题构建端到端生命周期时，需要设计与执行大量机器学习流水线，由此产生海量的生命周期版本。为此，本文提出VeML——一个专用于端到端机器学习生命周期的版本管理系统。我们的系统攻克了其他系统尚未解决的若干关键问题。首先，我们解决了构建机器学习生命周期的高昂成本问题，尤其是针对大规模高维数据集。我们通过将系统中管理的相似数据集的生命周期迁移至新的训练数据来解决该问题，并设计了基于核心集（core set）的算法，用以高效计算大规模高维数据间的相似度。另一个关键问题是机器学习生命周期内因训练数据与测试数据差异导致的模型精度退化，这会触发生命周期的重建。我们的系统无需从测试数据中获取标注数据，即可检测到这种不匹配，并为新数据版本重建机器学习生命周期。为验证我们的贡献，我们在真实世界的驾驶图像与时空传感器数据等大规模数据上开展了实验，并展示了具有前景的结果。