Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/
翻译:投机解码(Speculative Decoding, SD)通过让轻量级草稿模型生成候选序列,由大型验证模型并行验证,从而降低大型语言模型的高推理成本。现有的草稿-验证方法采用二元决策:要么直接接受,要么完全重新计算。然而我们发现,许多被拒绝的token可以通过从完整验证器中经模型内路由派生的精简子模型正确验证,而非必须依赖完整验证器。这促使我们提出用精简验证器处理需要中等验证资源的token,减少昂贵的大模型调用。本文提出基于模型内路由的投机解码验证方法(Verification via Intra-Model Routing for Speculative Decoding, VIA-SD),这是一种采用路由式精简验证器的多层级框架。草稿token按层次化机制处理:对高置信度案例直接接受,对中等置信度案例由精简验证器重新生成,对低置信度案例则由完整模型验证。在四个代表性任务和多个模型族上的实验表明,VIA-SD将拒绝率降低0.10-0.22,在强SD基线基础上实现10-20%的加速,相比非草稿式解码实现2.5-3倍加速。此外,VIA-SD无需修改现有SD框架的训练流程即可兼容。我们的结果表明,多层级SD可作为可扩展且高效的LLM推理的通用范式。项目页面:https://zju-xyc.github.io/VIA-SD-Project-Page/