GPUs have become the defacto hardware devices to accelerate Deep Neural Network (DNN) inference in deep learning(DL) frameworks. However, the conventional sequential execution mode of DNN operators in mainstream DL frameworks cannot fully utilize GPU resources, due to the increasing complexity of DNN model structures and the progressively smaller computational sizes of DNN operators. Moreover, the inadequate operator launch order in parallelized execution scenarios can lead to GPU resource wastage and unexpected performance interference among operators. To address such performance issues above, we propose Opara, a resource- and interference-aware DNN Operator parallel scheduling framework to accelerate the execution of DNN inference on GPUs. Specifically, Opara first employs CUDA Streams and CUDA Graph to automatically parallelize the execution of multiple DNN operators. It further leverages the resource demands of DNN operators to judiciously adjust the operator launch order on GPUs by overlapping the execution of compute-intensive and memory-intensive operators, so as to expedite DNN inference. We implement and open source a prototype of Opara based on PyTorch in a non-intrusive manner. Extensive prototype experiments with representative DNN and Transformer-based models demonstrate that Opara outperforms the default sequential CUDA Graph in PyTorch and the state-of-the-art DNN operator parallelism systems by up to 1.68$\times$ and 1.29$\times$, respectively, yet with acceptable runtime overhead.
翻译:GPU已成为深度学习框架中加速深度神经网络推理的事实标准硬件设备。然而,随着深度神经网络模型结构的日益复杂以及算子计算规模的逐步减少,主流深度学习框架中传统的串行算子执行模式无法充分利用GPU资源。此外,在并行化执行场景中不恰当的算子启动顺序会导致GPU资源浪费以及算子间意外的性能干扰。为解决上述性能问题,我们提出Opara——一种资源与干扰感知的深度神经网络算子并行调度框架,用于加速GPU上的推理执行。具体而言,Opara首先利用CUDA流和CUDA图自动并行化多个算子的执行;通过重叠计算密集型与内存密集型算子的执行,进一步根据算子的资源需求谨慎调整GPU上的算子启动顺序,从而加速深度神经网络推理。我们基于PyTorch以非侵入方式实现并开源了Opara原型系统。在代表性深度神经网络和基于Transformer的模型上进行的广泛原型实验表明,相较于PyTorch默认的串行CUDA图以及最先进的深度神经网络算子并行系统,Opara分别实现最高1.68倍和1.29倍的加速,同时运行时开销在可接受范围内。