While distributed device-edge speculative decoding enhances resource utilization across heterogeneous nodes, its performance is often bottlenecked by conventional token-level verification strategies. Such rigid alignment leads to excessive rejections, significantly diminishing the accepted sequence length and increasing interaction rounds under fluctuating wireless conditions. In this paper, we propose WISV (Wireless-Informed Semantic Verification), a novel distributed speculative decoding framework that goes beyond strict token-level matching via a channel-aware semantic acceptance policy. WISV integrates a lightweight decision head into the edge-side target LLM to dynamically evaluate speculative tokens by synthesizing high-dimensional hidden representations with instantaneous channel state information (CSI). To optimize the trade-off between verification fidelity and communication overhead, we further design two tailored communication protocols: full-hidden upload and mismatch-first selective-hidden upload. Extensive simulations using a 1B drafter and an 8B target model demonstrate that WISV achieves up to a 60.8% increase in accepted length, a 37.3% reduction in interaction rounds, and a 31.4% improvement in end-to-end latency compared to vanilla speculative decoding across tested settings, while maintaining a negligible task accuracy drop (<1%). Finally, we validate WISV on a hardware testbed comprising an NVIDIA Jetson AGX Orin and an A40-equipped server, confirming its real-world efficacy in accelerating edge-deployed LLM inference.
翻译:尽管分布式设备-边缘推测解码可提升跨异构节点的资源利用率,其性能常受限于传统的词元级验证策略。这种刚性对齐导致过多拒绝,显著缩短接受序列长度,并在波动无线环境下增加交互轮次。本文提出WISV(无线感知语义验证)——一种新型分布式推测解码框架,通过信道感知语义接受策略突破严格的词元级匹配约束。WISV在边缘侧目标大语言模型中集成轻量级决策头,通过融合高维隐式表征与瞬时信道状态信息(CSI)动态评估推测词元。为优化验证保真度与通信开销间的权衡,我们进一步设计两种定制化通信协议:全隐藏层上传与失配优先选择性隐藏层上传。基于1B草稿模型与8B目标模型的仿真实验表明,在各类测试设置下,相比原始推测解码,WISV的接受长度提升最高达60.8%,交互轮次减少37.3%,端到端延迟降低31.4%,同时任务精度损失可忽略(<1%)。最后,我们在由NVIDIA Jetson AGX Orin和配备A40的服务器组成的硬件测试平台上验证了WISV,证实其在加速边缘部署大语言模型推理中的实际效能。