Prior works have attempted to build private inference frameworks for transformer-based large language models (LLMs) in a server-client setting, where the server holds the model parameters and the client inputs the private data for inference. However, these frameworks impose significant overhead when the private inputs are forward propagated through the original LLMs. In this paper, we show that substituting the computation- and communication-heavy operators in the transformer architecture with privacy-computing friendly approximations can greatly reduce the private inference costs with minor impact on model performance. Compared to the state-of-the-art Iron (NeurIPS 2022), our privacy-computing friendly model inference pipeline achieves a $5\times$ acceleration in computation and an 80\% reduction in communication overhead, while retaining nearly identical accuracy.
翻译:先前的研究尝试在服务器-客户端场景下为基于Transformer的大语言模型构建私有推理框架,其中服务器持有模型参数,客户端输入私有数据进行推理。然而,当私有输入通过原始LLM进行前向传播时,这些框架会带来显著的开销。本文表明,用隐私计算友好的近似替代Transformer架构中计算和通信密集的操作,可以大幅降低私有推理成本,同时对模型性能的影响极小。与最先进的Iron(NeurIPS 2022)相比,我们的隐私计算友好模型推理流水线在计算上实现了5倍加速,通信开销降低了80%,同时保持了几乎相同的准确率。