There is a growing demand to deploy computation-intensive deep learning (DL) models on resource-constrained mobile devices for real-time intelligent applications. Equipped with a variety of processing units such as CPUs, GPUs, and NPUs, the mobile devices hold potential to accelerate DL inference via parallel execution across heterogeneous processors. Various efficient parallel methods have been explored to optimize computation distribution, achieve load balance, and minimize communication cost across processors. Yet their practical effectiveness in the dynamic and diverse real-world mobile environment is less explored. This paper presents a holistic empirical study to assess the capabilities and challenges associated with parallel DL inference on heterogeneous mobile processors. Through carefully designed experiments covering various DL models, mobile software/hardware environments, workload patterns, and resource availability, we identify limitations of existing techniques and highlight opportunities for cross-level optimization.
翻译:随着实时智能应用的普及,在资源受限的移动设备上部署计算密集型的深度学习模型需求日益增长。移动设备配备CPU、GPU、NPU等多种处理单元,具备通过异构处理器并行执行加速深度学习推理的潜力。已有多种高效的并行方法被探索用于优化计算分布、实现负载均衡以及降低跨处理器通信成本。然而,这些方法在动态多样化的真实移动环境中的实际有效性仍未得到充分研究。本文通过系统性的实证研究,评估了在异构移动处理器上进行并行深度学习推理的能力与挑战。通过精心设计的实验,涵盖多种深度学习模型、移动软硬件环境、工作负载模式以及资源可用性情况,我们识别出现有技术的局限性,并指出跨层级优化的潜在机遇。