This work is motivated by recent developments in Deep Neural Networks, particularly the Transformer architectures underlying applications such as ChatGPT, and the need for performing inference on mobile devices. Focusing on emerging transformers (specifically the ones with computationally efficient Swin-like architectures) and large models (e.g., Stable Diffusion and LLMs) based on transformers, we observe that layout transformations between the computational operators cause a significant slowdown in these applications. This paper presents SmartMem, a comprehensive framework for eliminating most layout transformations, with the idea that multiple operators can use the same tensor layout through careful choice of layout and implementation of operations. Our approach is based on classifying the operators into four groups, and considering combinations of producer-consumer edges between the operators. We develop a set of methods for searching such layouts. Another component of our work is developing efficient memory layouts for 2.5 dimensional memory commonly seen in mobile devices. Our experimental results show that SmartMem outperforms 5 state-of-the-art DNN execution frameworks on mobile devices across 18 varied neural networks, including CNNs, Transformers with both local and global attention, as well as LLMs. In particular, compared to DNNFusion, SmartMem achieves an average speedup of 2.8$\times$, and outperforms TVM and MNN with speedups of 6.9$\times$ and 7.9$\times$, respectively, on average.
翻译:本研究受深度神经网络最新进展驱动,尤其是支撑ChatGPT等应用的Transformer架构,以及移动设备端推理需求的推动。聚焦新兴Transformer(特别是采用计算高效型Swin类架构的模型)及基于Transformer的大模型(如Stable Diffusion和LLMs),我们发现计算算子间的布局变换会显著降低这些应用的执行速度。本文提出SmartMem——一套系统性消除大部分布局变换的框架,其核心理念在于:通过精心选择布局并实现算子操作,可使多个算子共享同一张量布局。我们的方法将算子划分为四类,并综合考虑算子间生产者-消费者边的组合,据此开发了一系列布局搜索方法。另一项贡献在于为移动设备中常见的2.5维内存设计了高效内存布局。实验结果表明,在18种涵盖CNN、局部/全局注意力Transformer及LLM的多样化神经网络上,SmartMem在移动设备上的表现优于5种主流DNN执行框架。具体而言,相较于DNNFusion、TVM和MNN,SmartMem分别实现了平均2.8倍、6.9倍和7.9倍的加速。