Cloud-hosted transformer and large language model (LLM) inference creates a direct confidentiality problem: user prompts may contain sensitive code, business data, personal information, or regulated documents, yet remote serving exposes intermediate state to the cloud software stack and accelerator runtime. Fully homomorphic encryption (FHE) keeps accelerator-side execution ciphertext-only, but end-to-end LLM inference remains expensive because linear layers are interleaved with non-linear, cache-state, and refresh-sensitive operators. CPU trusted execution environments (TEEs) can execute those operators natively, but a CPU TEE alone does not define how an untrusted accelerator should participate. We present Bifrost, a hybrid TEE-FHE serving architecture in which secrets are provisioned only to an attested CPU TEE, while the accelerator, device memory, driver/runtime stack, and host software remain outside the trusted computing base. Bifrost uses FHE as a secure delegation mechanism for projection and feed-forward linear layers on accelerator-backed CKKS, while non-linear operators, attention-side control logic, KV-state transitions, and decrypt-then-encrypt refresh execute inside the CPU TEE. Bifrost+ further applies a prefill/decode split: prompt-side KV state is built inside the CPU TEE, and only decode-side state enters the hybrid ciphertext path. In an estimator-style comparison matching Euston's methodology, Bifrost reduces projected latency by 9.25x on GPT-2 (1.5B) and 9.91x on LLaMA 3 (8B). In direct CKKS/FHE deployments, Bifrost+ reduces TTFT by 14.6-45.8x on GPT-2 (124M) and 15.3-53.4x on Qwen3 (0.6B). The systems lesson is selective encrypted execution: use FHE only where ciphertext-only accelerator delegation is required, and keep non-linear, refresh, and prompt-side work inside the CPU TEE.
翻译:摘要:云端托管的Transformer及大型语言模型(LLM)推理服务直接引发机密性问题:用户提示词可能包含敏感代码、商业数据、个人信息或受监管文件,而远程服务过程将中间状态暴露给云端软件栈及加速器运行时。全同态加密(FHE)虽能确保加速器端仅处理密文,但端到端的LLM推理仍代价高昂,因为线性层与非线性层、缓存状态及刷新敏感操作相互交织。CPU可信执行环境(TEE)可原生执行此类操作,但单独的CPU TEE无法定义非可信加速器的参与方式。本文提出Bifrost——一种混合TEE-FHE服务架构:密钥仅注入经认证的CPU TEE,而加速器、设备内存、驱动/运行时栈及宿主机软件均处于可信计算基之外。Bifrost将FHE作为安全委托机制,用于加速器上基于CKKS的投影层与前馈线性层;而非线性操作、注意力侧控制逻辑、KV状态转换以及"解密-加密"刷新操作则均在CPU TEE内执行。Bifrost+进一步应用预填充/解码分离策略:提示词侧KV状态在CPU TEE内部构建,仅解码侧状态进入混合密文路径。在遵循Euston方法论的标准对比评估中,Bifrost在GPT-2(1.5B)和LLaMA 3(8B)上分别实现预测延迟降低9.25倍和9.91倍。在直接CKKS/FHE部署场景下,Bifrost+将GPT-2(124M)的TTFT延迟降低14.6-45.8倍,将Qwen3(0.6B)的TTFT延迟降低15.3-53.4倍。系统层面的教训是选择性加密执行:仅在需要加速器全密文委托时使用FHE,而非线性操作、刷新操作及提示词侧工作负载则保留在CPU TEE内执行。