Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pixel~5 smartphone).
翻译:在移动设备上部署深度神经网络日益重要,但由于计算资源有限,这仍然具有挑战性。另一方面,移动设备的统一内存架构以及CPU与GPU性能之间较小的差距,为通过将任务分配给CPU和GPU来减少推理延迟提供了机会。这种协同执行的主要障碍在于:合并部分结果所需的大量同步开销,以及预测分配给CPU和GPU的任务执行时间的困难(这是由于实现方式和并行级别的动态选择所致)。为了克服这些障碍,我们提出了一种基于OpenCL细粒度共享虚拟内存(SVM)的轻量级同步机制,以及用于准确预测执行时间的机器学习模型。值得注意的是,这些模型捕捉了GPU内核的性能特征,并考虑了它们的调度时间。在四个移动平台上的综合评估表明,我们的方法能够快速选择CPU-GPU协同执行策略,对于线性层可实现高达1.89倍的加速,对于卷积层可实现高达1.75倍的加速(接近在Pixel~5智能手机上通过穷举网格搜索找到的各自可实现最大值2.01倍和1.87倍)。