The community explored to build private inference frameworks for transformer-based large language models (LLMs) in a server-client setting, where the server holds the model parameters and the client inputs its private data (or prompt) for inference. However, these frameworks impose significant overhead when the private inputs are forward propagated through the original LLMs. In this paper, we show that substituting the computation- and communication-heavy operators in the transformer architecture with privacy-computing friendly approximations can greatly reduce the private inference costs while incurring very minor impact on model performance. Compared to state-of-the-art Iron (NeurIPS 2022), our privacy-computing friendly model inference pipeline achieves a $5\times$ acceleration in computation and an 80% reduction in communication overhead, while retaining nearly identical accuracy.
翻译:社区探索了在服务器-客户端设置下为基于Transformer的大型语言模型(LLMs)构建私有推理框架,其中服务器持有模型参数,客户端输入其私有数据(或提示)进行推理。然而,当私有输入通过原始LLMs进行前向传播时,这些框架会带来显著的开销。本文表明,用隐私计算友好的近似替代Transformer架构中计算密集和通信密集的算子,可以大幅降低私有推理成本,同时仅对模型性能产生极小影响。与最先进的Iron(NeurIPS 2022)相比,我们的隐私计算友好模型推理流水线在计算速度上实现了5倍加速,通信开销降低了80%,同时保持了几乎相同的准确率。