Deploying large language models (LLMs) as cloud services raises privacy concerns as inference may leak sensitive data. Fully Homomorphic Encryption (FHE) allows computation on encrypted data, but current FHE methods struggle with efficient and precise nonlinear function evaluation. Specifically, CKKS-based approaches require high-degree polynomial approximations, which are costly when target precision increases. Alternatively, TFHE's Programmable Bootstrapping (PBS) outperforms CKKS by offering exact lookup-table evaluation. But it lacks high-precision implementations of LLM nonlinear layers and underutilizes GPU resources. We propose \emph{TIGER}, the first GPU-accelerated framework for high-precision TFHE-based nonlinear LLM layer evaluation. TIGER offers: (1) GPU-optimized WoP-PBS method combined with numerical algorithms to surpass native lookup-table precision limits on nonlinear functions; (2) high-precision and efficient implementations of key nonlinear layers, enabling practical encrypted inference; (3) batch-driven design exploiting inter-input parallelism to boost GPU efficiency. TIGER achieves 7.17$\times$, 16.68$\times$, and 17.05$\times$ speedups over a CPU baseline for GELU, Softmax, and LayerNorm, respectively.
翻译:将大型语言模型作为云服务部署会带来隐私风险,因为推理过程可能泄露敏感数据。全同态加密允许对加密数据进行计算,但当前全同态加密方法在高效且精确地评估非线性函数方面存在困难。具体而言,基于CKKS的方法需要高次多项式近似,当目标精度提高时,其计算成本会显著增加。相比之下,TFHE的可编程自举通过提供精确的查找表评估而优于CKKS,但它缺乏LLM非线性层的高精度实现,且未能充分利用GPU资源。我们提出TIGER,这是首个面向基于TFHE的高精度非线性LLM层评估的GPU加速框架。TIGER具备以下特性:(1) 将GPU优化的WoP-PBS方法与数值算法相结合,突破原生查找表在非线性函数上的精度限制;(2) 实现关键非线性层的高精度高效实现,支持实用的加密推理;(3) 采用批量驱动设计,利用输入间并行性提升GPU效率。在GELU、Softmax和LayerNorm上,TIGER相比CPU基线分别实现了7.17倍、16.68倍和17.05倍的加速。