Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable. We present EAGLE-Pangu, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs. EAGLE-Pangu contributes (i) an explicit branch/commit cache manager built on the Cache API, (ii) accelerator-safe tree tensorization that removes undefined negative indices by construction and validates structural invariants, and (iii) a fused-kernel-compatible teacher verification path with a debuggable eager fallback. On 240 turns from MT-Bench and HumanEval-style prompts, EAGLE-Pangu improves end-to-end decoding throughput by 1.27x on average, up to 2.46x at p99, over teacher-only greedy decoding in the fused-kernel performance path. We also provide a fused-kernel-free reference path with structured traces and invariant checks to support reproducible debugging and ablation across execution modes and tree budgets.
翻译:自回归解码仍然是大型语言模型(LLM)服务的主要瓶颈,这推动了推测解码方法的发展,该方法通过每步验证多个候选标记来减少昂贵的教师模型调用。树形结构推测进一步提高了并行性,但在跨异构后端和加速器栈移植时通常较为脆弱,因为注意力掩码、KV缓存布局和索引语义不可互换。本文提出EAGLE-Pangu,这是一个可复现的系统,将EAGLE-3风格的树形推测解码移植到基于昇腾NPU的Pangu教师后端。EAGLE-Pangu的贡献包括:(i)基于缓存API构建的显式分支/提交缓存管理器;(ii)通过构造消除未定义负索引并验证结构不变量的加速器安全树形张量化方法;(iii)兼容融合内核的教师验证路径,并配备可调试的即时回退机制。在来自MT-Bench和HumanEval风格提示的240轮测试中,与仅使用教师的贪婪解码相比,EAGLE-Pangu在融合内核性能路径上平均将端到端解码吞吐量提升1.27倍,在p99分位数处最高提升2.46倍。我们还提供了一个无需融合内核的参考路径,包含结构化追踪和不变性检查,以支持跨执行模式和树形预算的可复现调试与消融分析。