CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

The widespread deployment of cloud-hosted generative models raises a fundamental challenge: enabling efficient autoregressive generation while preserving the privacy of both user prompts and model parameters in untrusted environments. We address this challenge in a client-server setting where an untrusted server hosts an autoregressive Transformer and the client requires cryptographic protection for both inputs and inference. We present CryptoGen, the first system to enable scalable privacy-preserving neural generation with persistent encrypted key-value (KV) cache reuse. Discriminative-task secure inference systems incur quadratic latency and memory growth when adapted to autoregressive decoding due to the lack of native encrypted KV-cache support. In contrast, CryptoGen achieves near-linear scaling by securely reusing and updating encrypted KV caches throughout generation. CryptoGen integrates homomorphic encryption and secret sharing to support both prefilling and generation. Key techniques include a unified encrypted KV-cache framework, heterogeneous SIMD encodings for different phases, optimized cipher-cipher matrix-matrix and matrix-vector operations, and efficient noise refresh and ciphertext concatenation mechanisms. Evaluation on generative Transformer models trained on WikiText-2, PTB, and LAMBADA shows that for input lengths of 128-512 tokens, CryptoGen achieves 4.4x-7.6x lower per-token latency than state-of-the-art discriminative secure inference systems, while maintaining near-linear latency and memory scaling, with advantages increasing for longer sequences. CryptoGen is released as an open-source library.

翻译：云托管生成模型的广泛部署带来了一个根本性挑战：如何在不可信环境中实现高效的自回归生成，同时保护用户提示和模型参数的隐私。我们在客户端-服务器场景下应对这一挑战，其中不可信服务器托管一个自回归Transformer模型，而客户端需要对输入和推理过程进行密码学保护。我们提出了CryptoGen，这是首个支持可扩展隐私保护神经生成并实现持久性加密键值（KV）缓存重用的系统。由于缺乏原生加密KV缓存支持，判别式任务安全推理系统在适配自回归解码时会产生二次方的延迟和内存增长。相比之下，CryptoGen通过在整个生成过程中安全地重用和更新加密KV缓存，实现了近似线性的扩展。CryptoGen集成了同态加密和秘密共享技术，以同时支持预填充和生成阶段。关键技术包括统一的加密KV缓存框架、针对不同阶段的异构SIMD编码、优化的密文-密文矩阵-矩阵与矩阵-向量运算，以及高效的噪声刷新和密文拼接机制。在基于WikiText-2、PTB和LAMBADA训练的生成式Transformer模型上的评估表明，对于128-512个令牌的输入长度，CryptoGen的每令牌延迟比最先进的判别式安全推理系统低4.4倍至7.6倍，同时保持近似线性的延迟和内存扩展，且序列越长优势越明显。CryptoGen已作为开源库发布。