End users face a choice between privacy and efficiency in current Large Language Model (LLM) service paradigms. In cloud-based paradigms, users are forced to compromise data locality for generation quality and processing speed. Conversely, edge device paradigms maintain data locality but fail to deliver satisfactory performance. In this work, we propose a novel LLM service paradigm that distributes privacy-sensitive computation on edge devices and shared computation in the cloud. Only activations are transmitted between the central cloud and edge devices to ensure data locality. Our core innovation, PrivateLoRA, addresses the challenging communication overhead by exploiting the low rank of residual activations, achieving over 95% communication reduction. Consequently, PrivateLoRA effectively maintains data locality and is extremely resource efficient. Under standard 5G networks, PrivateLoRA achieves throughput over 300% of device-only solutions for 7B models and over 80% of an A100 GPU for 33B models. PrivateLoRA also provides tuning performance comparable to LoRA for advanced personalization. Our approach democratizes access to state-of-the-art generative AI for edge devices, paving the way for more tailored LLM experiences for the general public. To our knowledge, our proposed framework is the first efficient and privacy-preserving LLM solution in the literature.
翻译:在现有大语言模型服务范式中,终端用户面临隐私与效率的两难抉择。基于云端的方案虽能保证生成质量与处理速度,却迫使数据脱离本地控制;而边缘设备方案虽能维护数据本地性,却无法提供令人满意的性能表现。本文提出一种新型大语言模型服务范式,将隐私敏感计算分布至边缘设备,共享计算部署于云端,仅通过中央云与边缘设备间传输激活值来确保数据本地性。核心创新PrivateLoRA通过利用残差激活值的低秩特性,成功将通信开销降低95%以上,在保障数据本地性的同时实现极致的资源效率。在标准5G网络环境下,PrivateLoRA对70亿参数模型的吞吐量达到纯设备方案的300%以上,对330亿参数模型达到A100 GPU的80%以上。该架构还提供了与LoRA相媲美的调优性能以支持高级个性化需求。本方案使边缘设备能够普惠访问最先进的生成式AI技术,为大众定制化大语言模型应用铺平道路。据我们所知,这是文献中首个兼具高效性与隐私保护特性的大语言模型解决方案。