Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors. We evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters ranging from 2B to 14B. Specifically, we achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM2 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. Compared with CPU-based FastLLM and GPU-based MLC-LLM, our engine attains over 10x speedup for the prefill speed and 2~3x speedup for the decoding speed.

翻译：大语言模型（LLM）广泛应用于智能手机的智能助手、文本摘要、翻译及多模态任务中。然而，当前在设备端部署LLM的方法存在推理速度慢的问题，导致用户体验不佳。为实现在设备端GPU上高效部署LLM，我们提出四项优化技术：（a）基于符号表达式的方法以支持动态形状模型推理；（b）算子优化与执行优先级设置以提升推理速度并减少手机卡顿；（c）一种名为M0E4的FP4量化方法以降低反量化开销；（d）一种基于子张量的技术以消除LLM推理后复制KV缓存的需求。此外，我们将这些方法实现于移动推理引擎Transformer-Lite中，该引擎兼容高通和联发科处理器。我们使用架构各异、参数量从2B到14B的LLM评估了Transformer-Lite的性能。具体而言，对于ChatGLM2 6B模型，我们实现了121 token/s的预填充速度和14 token/s的解码速度；对于较小的Gemma 2B模型，分别达到330 token/s和30 token/s。与基于CPU的FastLLM和基于GPU的MLC-LLM相比，我们的引擎在预填充速度上实现超过10倍加速，解码速度实现2~3倍加速。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日