The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23$\times$ communication and 1.94$\times$ latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.
翻译:专有语言模型的普及引发了用户敏感数据的隐私担忧,凸显了隐私推理(PI)的必要性,即在加密输入上直接执行推理。然而,当前的隐私推理方法面临着过高的通信和延迟开销,这主要源于非线性运算。本文提出了一项全面分析,以理解非线性运算在基于Transformer的解码器专用语言模型中的作用。我们介绍了AERO,一个四步架构优化框架,通过系统性地移除诸如LayerNorm和GELU等非线性运算并减少FLOPs计数,对现有LLM架构进行改进以适配高效隐私推理。我们首次提出了一种专为高效隐私推理定制、FLOPs显著减少的纯Softmax架构。此外,我们设计了一种新颖的熵正则化技术来提升纯Softmax模型的性能。AERO实现了高达4.23倍的通信减少和1.94倍的延迟降低。我们通过与最先进技术进行基准测试,验证了AERO的有效性。