Large Language Models (LLMs) have achieved remarkable performance and received significant research interest. The enormous computational demands, however, hinder the local deployment on devices with limited resources. The current prevalent LLM inference paradigms require users to send queries to the service providers for processing, which raises critical privacy concerns. Existing approaches propose to allow the users to obfuscate the token embeddings before transmission and utilize local models for denoising. Nonetheless, transmitting the token embeddings and deploying local models may result in excessive communication and computation overhead, preventing practical implementation. In this work, we propose \textbf{DEL}, a framework for \textbf{D}ifferentially private and communication \textbf{E}fficient \textbf{L}LM split inference. More specifically, an embedding projection module and a differentially private stochastic quantization mechanism are proposed to reduce the communication overhead in a privacy-preserving manner. To eliminate the need for local models, we adapt soft prompt at the server side to compensate for the utility degradation caused by privacy. To the best of our knowledge, this is the first work that utilizes soft prompt to improve the trade-off between privacy and utility in LLM inference, and extensive experiments on text generation and natural language understanding benchmarks demonstrate the effectiveness of the proposed method.
翻译:大语言模型(LLMs)已取得显著性能并引起广泛研究关注。然而,其巨大的计算需求阻碍了在资源受限设备上的本地部署。当前主流的LLM推理范式要求用户将查询发送至服务提供商进行处理,这引发了严重的隐私担忧。现有方法提出允许用户在传输前对词元嵌入进行混淆,并利用本地模型进行去噪。然而,传输词元嵌入及部署本地模型可能导致过高的通信与计算开销,难以实际应用。本文提出\textbf{DEL}框架,实现\textbf{差分隐私}与\textbf{通信高效}的\textbf{大语言模型}分割推理。具体而言,我们设计了嵌入投影模块与差分隐私随机量化机制,以隐私保护的方式降低通信开销。为消除对本地模型的需求,我们在服务器端采用软提示技术,以补偿因隐私保护导致的性能损失。据我们所知,这是首个利用软提示技术优化LLM推理中隐私与性能权衡的研究。在文本生成与自然语言理解基准测试上的大量实验验证了所提方法的有效性。