tf.data service: A Case for Disaggregating ML Input Data Processing

Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.

翻译：机器学习（ML）计算通常在昂贵的专用硬件（如GPU和TPU）上执行，这些硬件提供高FLOPs和能效。为了提升成本效率，必须保持这些加速器的高利用率。这要求以加速器处理ML计算的速度预处理输入数据。为避免数据停滞，每个用于ML计算的加速器核心所需的宿主机CPU和内存因任务而异。因此，在ML加速器宿主机上采用固定硬件比例的传统方法会导致加速器或宿主机CPU与内存的利用率不足。本文通过构建解耦的ML数据处理系统来解决这些问题。我们提出tf.data service——一个基于TensorFlow中tf.data构建的开源解耦输入数据处理服务。研究表明，数据预处理的解耦对大规模ML训练任务具有三个关键优势。首先，该服务可水平扩展以精确匹配每个任务数据处理所需的CPU/内存宿主机资源，平均节省32倍训练时间和26倍成本。其次，该服务能跨任务共享临时预处理数据结果，以优化CPU使用并减少冗余计算。最后，该服务支持协同读取技术，通过避免分布式训练中因输入数据大小差异导致的落后者，使训练时间平均缩短2.2倍。我们的设计源于部署tf.data service的生产实践，包括在不影响模型准确性的前提下放宽数据访问保证。