Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct responsibilities to handle serving requests, i.e. generalpurpose CPUs for input preprocessing and domain-specific GPUs for forward computation. Recurrent neural networks play an essential role in handling temporal inputs and display distinctive computation characteristics because of their high inter-operator parallelism. Hence, we propose Chrion to optimize recurrent neural network inference by collaboratively utilizing CPUs and GPUs. We formulate the model deployment in the CPU-GPU cluster as an NP-hard scheduling problem of directed acyclic graphs on heterogeneous devices. Given an input model in the ONNX format and user-defined SLO requirement, Chrion firstly preprocesses the model by model parsing and profiling, and then partitions the graph to select execution devices for each operator. When an online request arrives, Chrion performs forward computation according to the graph partition by executing the operators on the CPU and GPU in parallel. Our experimental results show that the execution time can be reduced by 19.4% at most in the latency-optimal pattern and GPU memory footprint by 67.5% in the memory-optimal pattern compared with the execution on the GPU.
翻译:在云端集群中部署深度学习模型可提供高效、快速的推理服务,以应对深度学习的广泛应用。这类集群通常配备承担不同职责的宿主CPU与加速器来处理服务请求:通用CPU负责输入预处理,领域专用GPU负责前向计算。循环神经网络在处理时序输入时具有重要作用,并因其算子间高度并行性而展现出独特的计算特征。为此,我们提出Chrion方法,通过CPU与GPU协同协作来优化循环神经网络推理。我们将CPU-GPU集群中的模型部署问题形式化为异构设备上有向无环图的NP难调度问题。针对ONNX格式的输入模型及用户定义的SLO需求,Chrion首先通过模型解析与性能剖析进行预处理,随后对计算图进行划分,为每个算子选择执行设备。当在线请求到达时,Chrion根据图划分结果,通过在CPU与GPU上并行执行算子来完成前向计算。实验结果表明,与纯GPU执行相比,本方法在延迟最优模式下最多可降低19.4%的执行时间,在内存最优模式下可减少67.5%的GPU内存占用。