The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm
翻译:专有语言模型的普及引发了严重的隐私担忧,这推动了私有推理技术的进步,其中计算直接在加密数据上进行,无需暴露用户的敏感信息。尽管私有推理提供了一个有前景的解决方案,但其实际部署受到显著的通信和延迟开销的阻碍,这些开销主要源于非线性操作。为了解决这个问题,我们引入了一个信息论框架来刻画仅解码器语言模型中非线性操作的作用,为优化适应私有推理需求的Transformer架构奠定了原则性基础。通过利用香农熵作为量化度量,我们揭示了非线性操作先前未被探索的双重重要性:除了确保训练稳定性外,它们对于维持注意力头多样性至关重要。具体而言,我们发现移除非线性操作会引发两种关键的失效模式:深层中的{\em 熵坍缩}会破坏训练稳定性,以及浅层中的{\em 熵过载}会导致多头注意力机制的表示能力未被充分利用。我们提出了一种熵引导注意力机制,并结合一种新颖的熵正则化技术来缓解熵过载问题。此外,我们探索了适用于私有推理的层归一化替代方案,以防止熵坍缩并稳定减少非线性操作的大语言模型的训练。我们的研究弥合了信息论与架构设计之间的鸿沟,确立了熵动态作为开发高效私有推理架构的原则性指导。代码和实现可在 https://github.com/Nandan91/entropy-guided-attention-llm 获取。