Deep Neural Networks (DNNs) are extremely computationally demanding, which presents a large barrier to their deployment on resource-constrained devices. Since such devices are where many emerging deep learning applications lie (e.g., drones, vision-based medical technology), significant bodies of work from both the machine learning and systems communities have attempted to provide optimizations to accelerate DNNs. To help unify these two perspectives, in this paper we combine machine learning and systems techniques within the Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can be tightly dependent on each other with an across-stack perturbation study. We evaluate the impact on accuracy and inference time when varying different parameters of DLAS across two datasets, seven popular DNN architectures, four DNN compression techniques, three algorithmic primitives with sparse and dense variants, untuned and auto-scheduled code generation, and four hardware platforms. Our evaluation highlights how perturbations across DLAS parameters can cause significant variation and across-stack interactions. The highest level observation from our evaluation is that the model size, accuracy, and inference time are not guaranteed to be correlated. Overall we make 13 key observations, including that speedups provided by compression techniques are very hardware dependent, and that compiler auto-tuning can significantly alter what the best algorithm to use for a given configuration is. With DLAS, we aim to provide a reference framework to aid machine learning and systems practitioners in reasoning about the context in which their respective DNN acceleration solutions exist in. With our evaluation strongly motivating the need for co-design, we believe that DLAS can be a valuable concept for exploring the next generation of co-designed accelerated deep learning solutions.
翻译:深度神经网络(DNN)具有极高的计算需求,这对其在资源受限设备上的部署构成了巨大障碍。由于此类设备正是许多新兴深度学习应用(例如无人机、基于视觉的医疗技术)的载体,机器学习和系统两大领域的大量研究致力于提供优化方案以加速DNN。为统一这两类视角,本文在深度学习加速栈(DLAS)中融合了机器学习与系统技术,并通过跨栈扰动研究展示了这些层之间如何紧密耦合。我们基于两个数据集、七种主流DNN架构、四种DNN压缩技术、三种包含稀疏与稠密变体的算法原语、未调优与自动调度的代码生成方案,以及四种硬件平台,评估了DLAS不同参数变化对精度与推理时间的影响。评估结果揭示了DLAS参数扰动如何引发显著变化与跨栈交互。本评估的最高层次观察结论是:模型大小、精度与推理时间并不必然具有相关性。总体而言,我们提出了13项关键发现,包括压缩技术带来的加速效果高度依赖硬件,以及编译器自动调优可显著改变特定配置下的最佳算法选择。通过DLAS,我们希望提供一个参考框架,帮助机器学习与系统领域的从业者理解其各自DNN加速解决方案所处的上下文环境。鉴于我们的评估强有力地证明了协同设计的必要性,我们相信DLAS将成为探索下一代协同加速深度学习解决方案的重要概念。