While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of batch and streaming that enables efficient and fault-tolerant heterogeneous execution. The key idea is to use partitions as the unit of execution to achieve elasticity, but to allow partitions to be dynamically created and streamed between heterogeneous operators for memory-efficient pipelining. We present Ray Data, a streaming batch system that improves throughput on heterogeneous batch inference pipelines by 2.5-12$\times$ compared to traditional batch and stream processing systems. By leveraging heterogeneous clusters, Ray Data improves training throughput for multimodal models such as Stable Diffusion by 31% compared to single-node ML data loaders.
翻译:尽管机器学习模型的训练与推理均需大量GPU资源,CPU密集型的数据处理环节往往成为系统瓶颈。基于批处理或流处理模型的分布式数据处理系统通常假设计算资源需求是同质的,这类系统虽擅长CPU密集型计算,却难以有效利用异构资源,且在故障处理与系统重构时产生高昂开销。本文提出流式批处理模型——一种融合批处理与流处理特性的混合计算范式,能够实现高效且具备容错能力的异构执行。其核心思想是以数据分区作为执行单元以实现弹性扩展,同时允许分区在异构算子间动态创建与流式传输,从而构建内存高效的流水线架构。我们开发了Ray Data系统,该流式批处理系统在异构批推理流水线任务中,相比传统批处理与流处理系统可实现2.5-12倍的吞吐量提升。通过充分发挥异构集群优势,Ray Data在Stable Diffusion等多模态模型的训练任务中,相较单节点机器学习数据加载器可提升31%的训练吞吐量。