Modern cryptographic methods for implementing privacy-preserving LLMs such as Homomorphic Encryption (HE) require the LLMs to have a polynomial form. Forming such a representation is challenging because Transformers include non-polynomial components, such as Softmax and layer normalization. Previous approaches have either directly approximated pre-trained models with large-degree polynomials, which are less efficient over HE, or replaced non-polynomial components with easier-to-approximate primitives before training, e.g., Softmax with pointwise attention. The latter approach might introduce scalability challenges. We present a new HE-friendly variant of self-attention that offers a stable form for training and is easy to approximate with polynomials for secure inference. Our work introduces the first polynomial LLMs with 32 layers and over a billion parameters, exceeding the size of previous models by more than tenfold. The resulting models demonstrate reasoning and in-context learning (ICL) capabilities comparable to standard transformers of the same size, representing a breakthrough in the field. Finally, we provide a detailed latency breakdown for each computation over encrypted data, paving the way for further optimization, and explore the differences in inductive bias between transformers relying on our HE-friendly variant and standard transformers. Our code is attached as a supplement.
翻译:现代用于实现隐私保护大语言模型(LLM)的加密方法,如同态加密(HE),要求LLM具有多项式形式。构建此类表示具有挑战性,因为Transformer包含非多项式组件,例如Softmax和层归一化。先前的方法要么直接使用高次多项式近似预训练模型(这在HE上效率较低),要么在训练前将非多项式组件替换为更易于近似的原语(例如,将Softmax替换为逐点注意力)。后一种方法可能会引入可扩展性挑战。我们提出了一种新的、对HE友好的自注意力变体,它提供了稳定的训练形式,并且易于用多项式近似以实现安全推理。我们的工作首次实现了具有32层和超过十亿参数的多项式LLM,其规模超过了先前模型的十倍以上。所得模型展现出与同规模标准Transformer相当的推理和上下文学习(ICL)能力,代表了该领域的突破。最后,我们提供了在加密数据上每项计算的详细延迟分解,为进一步优化铺平了道路,并探讨了依赖我们这种HE友好变体的Transformer与标准Transformer在归纳偏置上的差异。我们的代码已作为补充材料附上。