We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact; (4) ablation experiments showing that n-gram speculation accepts 1.2-1.3 tokens per decoding step on average (peak of 7 observed on code), with acceptance rates consistent across model scales; (5) formal verification that lookahead decoding produces token-identical output to sequential decoding under greedy argmax, with zero quality degradation; and (6) scaling validation on Mistral NeMo 12B (40 layers), demonstrating that the system generalizes to larger models with only 4.9 GB local VRAM and matching 7B throughput. Evaluated on Mistral 7B and NeMo 12B over a ~80ms WAN link, our system achieves 8.7-9.3 tok/s (7B) and 7.8-8.7 tok/s (12B) with lookahead decoding, with an RTT decomposition model (validated at <6.2%% cross-validation error) projecting 15-19 tok/s at 20ms RTT.
翻译:本文提出一种实用的隐私感知大语言模型推理系统,该系统将Transformer模型分割部署于可信的本地GPU与不可信的云端GPU之间,仅通过网络传输中间激活值。我们的系统解决了自回归大语言模型在高延迟广域网络上解码的独特挑战,主要贡献包括:(1) 非对称层分割策略——嵌入层与解嵌入层保留在本地,确保原始词元永不离开可信设备;(2) 首次将前瞻解码技术应用于广域网分体推理,通过每次迭代处理多个词元分摊网络往返延迟;(3) 实证逆向攻击评估表明分割深度可提供可调节的隐私-性能权衡——在2层分割时攻击者可恢复约59%的词元,而在8层分割时仅能恢复约35%,且对吞吐量影响极小;(4) 消融实验显示n-元语法推测平均每个解码步骤接受1.2-1.3个词元(代码任务中观测到峰值达7个),且接受率在不同模型规模间保持稳定;(5) 形式化验证证明在贪婪argmax策略下,前瞻解码产生的输出与顺序解码具有词元级一致性,质量零损失;(6) 在Mistral NeMo 120亿参数模型上的扩展验证表明,该系统仅需4.9GB本地显存即可泛化至更大模型,并保持与70亿参数模型相当的吞吐量。在约80ms延迟的广域网链路上对Mistral 70亿和NeMo 120亿参数模型的评估显示,采用前瞻解码后系统分别达到8.7-9.3词元/秒和7.8-8.7词元/秒的吞吐量,经交叉验证误差低于6.2%的往返时延分解模型预测,在20ms往返延迟下吞吐量可达15-19词元/秒。