Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.
翻译:远程大语言模型服务中的计算完整性可能存疑。针对传统深度神经网络,现有基于可信执行环境的深度神经网络分区方法利用TEE计算非线性组件,并验证卸载至不可信GPU的线性组件的完整性。然而,将TSDP直接应用于基于Transformer的大语言模型会导致显著的TEE计算开销与TEE-GPU通信开销。本文提出面向通信高效的TEE-GPU注意力机制(VeriAttn),用于加速可验证的大语言模型推理。VeriAttn将注意力机制的线性与非线性计算均卸载至GPU,而TEE仅执行验证。此外,在预填充阶段,VeriAttn采用两级流水线技术重叠数据移动、TEE前后处理及GPU计算。在解码阶段,当键值缓存超出可用GPU内存时,VeriAttn将注意力计算在TEE与GPU间分区以减少键值重复传输。在Intel TDX平台上的评估表明,与TSDP相比,VeriAttn在预填充阶段针对6k token提示实现2.60-3.38倍加速,在解码阶段针对10k token输出实现3.86-5.42倍加速。