As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. We evaluate both metrics that affect user experience, including token throughput, latency, and battery consumption, as well as factors critical to developers, such as resource utilization, DVFS strategies, and inference engines. In addition, we provide a detailed analysis of how these hardware capabilities and system dynamics affect on-device LLM performance, which may help developers identify and address bottlenecks for mobile LLM applications. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.
翻译:随着大型语言模型(LLMs)日益融入我们工作和生活的各个方面,用户隐私问题日益受到关注,这推动了这些模型向本地部署的趋势发展。目前已有多种轻量级LLMs(如Gemini Nano、LLAMA2 7B)可在智能手机上本地运行,使用户能更好地掌控个人数据。作为快速兴起的应用场景,我们关注这些模型在商用现成移动设备上的实际表现。为全面了解当前移动平台部署LLMs的现状,我们在移动设备上开展了系统性测量研究。我们既评估了影响用户体验的指标(包括词元吞吐量、延迟和电池消耗),也考察了对开发者至关重要的因素(如资源利用率、动态电压频率调节策略和推理引擎)。此外,我们深入分析了硬件能力与系统动态特性如何影响设备端LLM性能,这有助于开发者识别并解决移动LLM应用的性能瓶颈。我们还对主流厂商的移动系统级芯片进行了全面对比,凸显了它们在处理LLM工作负载时的性能差异。本研究期望能为设备端LLM的开发和未来移动系统架构设计提供参考依据。