HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.

翻译：深度神经网络（DNN）通过多层结构和大量参数实现卓越性能。DNN模型的训练过程通常处理包含大量稀疏特征的大规模输入数据，由此产生高昂的输入/输出（IO）成本，而部分层则属于计算密集型。训练过程通常利用分布式计算资源以缩短训练时间。此外，异构计算资源（例如多种类型的CPU和GPU）可用于分布式训练过程。因此，将多层调度至不同计算资源的策略对于训练过程至关重要。为利用异构计算资源高效训练DNN模型，我们提出了一种分布式框架——Paddle-Heterogeneous Parameter Server（Paddle-HeterPS），该框架由分布式架构和基于强化学习（RL）的调度方法组成。与现有框架相比，Paddle-HeterPS具备三重优势：第一，Paddle-HeterPS支持异构计算资源下多样化工作负载的高效训练过程；第二，Paddle-HeterPS利用基于RL的方法将每层的工作负载高效调度至合适的计算资源，从而在满足吞吐量约束的同时最小化成本；第三，Paddle-HeterPS管理分布式计算资源间的数据存储与数据通信。我们进行了大量实验，结果表明Paddle-HeterPS在吞吐量（提升14.5倍）和货币成本（降低312.3%）方面显著优于当前最先进方法。该框架的代码公开于：https://github.com/PaddlePaddle/Paddle。