As large language models (LLMs) become ubiquitous, privacy concerns pertaining to inference inputs keep growing. In this context, fully homomorphic encryption (FHE) has emerged as a primary cryptographic solution to provide non-interactive confidential LLM inference. Existing solutions scale poorly with the input token length, and hence focus either on small models or larger models with a small number of input tokens. They also suffer from the existence of large outlier values. These values have a strong impact on the evaluation of non-linear layers, leading to large-degree polynomial approximation and thus heavy evaluation costs. We propose an FHE-based private LLM inference solution that allows thousands of input tokens with only a part of them being encrypted: this fits with a scenario where the context is benign and only part of the input is sensitive. To do so, we suggest an unbalanced chunked prefill framework that processes the private and public parts of the input tokens differently. Our framework contains plaintext-plaintext, plaintext-ciphertext and ciphertext-ciphertext computational components. We adopt different strategies and ingredients for each component. We also devise new homomorphic algorithms for specific matrix multiplication and polynomial evaluation tasks encountered during LLM inference. Furthermore, without retraining, we tailor the LLM inference algorithm to reduce the ranges of outlier values: we leverage machine learning strategies (token prepending and rotations) to mitigate the impact of the outliers on non-linear layers. Based on these ingredients, we describe a CKKS-based end-to-end implementation of Llama-2-7B private inference for up to 4096 input tokens, of which the last 128 are encrypted. On a cluster of 8~NVIDIA RTX-4090 GPUs, inference takes 85s for summarization and 33s for generation per output token.
翻译:随着大语言模型(LLM)的普及,关于推理输入的隐私担忧日益增长。在此背景下,全同态加密(FHE)已成为提供非交互式保密LLM推理的主要密码学解决方案。现有方案在输入令牌长度增加时扩展性较差,因此主要关注小型模型或仅能处理少量输入令牌的大型模型。这些方案还受到大量离群值存在的影响。这些值对非线性层的评估有显著影响,导致需要高次多项式逼近,从而产生高昂的评估成本。我们提出了一种基于FHE的私有LLM推理解决方案,允许处理数千个输入令牌,其中仅部分令牌被加密:这适用于上下文为良性、仅部分输入敏感的典型场景。为此,我们提出了一种非平衡分块预填充框架,以不同方式处理输入令牌的私有部分和公共部分。我们的框架包含明文-明文、明文-密文和密文-密文计算组件。我们为每个组件采用不同的策略和要素。我们还为LLM推理过程中遇到的特定矩阵乘法和多项式评估任务设计了新的同态算法。此外,在不重新训练模型的前提下,我们调整了LLM推理算法以减少离群值的范围:利用机器学习策略(令牌前置和轮换)来减轻离群值对非线性层的影响。基于这些要素,我们描述了一个基于CKKS的端到端Llama-2-7B私有推理实现,可处理多达4096个输入令牌,其中最后128个被加密。在由8台NVIDIA RTX-4090 GPU组成的集群上,推理任务耗时在摘要生成中为85秒,在每输出令牌的生成任务中为33秒。