FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

翻译：设备端大语言模型（LLM）呈现爆发式增长，与云端LLM相比，其在隐私性和可用性方面更具优势。在LLM推理过程中，模型权重和用户数据均具有重要价值，攻击者甚至可能通过入侵操作系统内核来窃取这些数据。ARM TrustZone是移动设备上基于硬件的实际隔离技术，用于保护敏感应用程序免受已入侵操作系统的影响。然而，由于其对内存和神经处理单元（NPU）的隔离机制缺乏灵活性，使用TrustZone保护LLM推理会带来显著的开销。为应对这些挑战，本文提出了FlexServe，一个面向移动设备的快速安全LLM服务系统。该系统首先引入了一种灵活资源隔离机制，以构建灵活安全内存（Flex-Mem）和灵活安全NPU（Flex-NPU）。内存页和NPU均能在非保护模式与保护模式之间高效切换。基于这些机制，FlexServe在TrustZone的安全世界中设计了一个快速安全的LLM推理框架。通过引入LLM感知内存管理和安全推理流水线来加速推理过程，并提出了多模型调度器以优化多模型工作流。我们实现了FlexServe的原型系统，并与两种基于TrustZone的基础设计方案进行了对比。实验结果表明，与基础方案相比，FlexServe在首词元生成时间（TTFT）上平均实现了$10.05\times$的加速；与启用了流水线和安全NPU的优化基础方案相比，平均TTFT加速达到$2.44\times$。对于多模型智能体工作流，端到端加速相比基础方案和优化基础方案分别最高可达$24.30\times$和$4.05\times$。