Deploying large language models (LLMs) on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible. However, existing mobile LLM inference frameworks suffer from high start-up latency due to their inevitable cold starts, i.e., launching LLM inferences when the model is not hosted in device memory. In this paper, we identify the key bottleneck of mobile LLM cold starts as the waste of flash bandwidth on unimportant model parameters. We design EdgeFlow, a mobile LLM inference framework that mitigates the cold start issue by adaptively adjusting the precisions of LLM parameters. Specifically, EdgeFlow leverages 1) an NPU-aware adaptive quantization algorithm that assigns different precisions to weights in a finer granularity according to their importance and NPU constraints, 2) an SIMD-friendly packing format that accelerates the transformation of various-precision weights into fixed-sized NPU-native data types, and 3) a synergistic granular pipeline that coordinates CPU and NPU computation in a fine-grained and dynamic manner. Experimental results show that EdgeFlow reduces cold-start latency by up to 4.07x compared with three state-of-the-art mobile LLM inference frameworks, i.e., llama.cpp, MNN, and llm.npu, under comparable model accuracy.
翻译:在移动设备上部署大语言模型(LLMs)是保障LLM应用数据隐私和离线可访问性的新兴趋势。现代移动神经处理单元(NPU)使得此类部署日益可行。然而,现有移动LLM推理框架因不可避免的冷启动(即模型未驻留设备内存时启动LLM推理)而面临高启动延迟问题。本文指出,移动LLM冷启动的关键瓶颈在于闪存带宽浪费在不重要的模型参数上。我们设计了EdgeFlow——一种通过自适应调整LLM参数精度来缓解冷启动问题的移动LLM推理框架。具体而言,EdgeFlow利用了:1)一种NPU感知的自适应量化算法,可根据权重重要性和NPU约束以更细粒度分配不同精度;2)一种SIMD友好的打包格式,可加速将不同精度权重转换为固定大小的NPU原生数据类型;3)一种协同粒度流水线,以细粒度动态方式协调CPU与NPU计算。实验结果表明,在可比模型精度下,相较于llama.cpp、MNN和llm.npu三种最先进的移动LLM推理框架,EdgeFlow将冷启动延迟降低了最高4.07倍。