Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.
翻译:大语言模型(LLMs)是图像描述、检索、推荐及创意内容生成等交互式多媒体应用的基础,但其自回归解码过程会带来显著的延迟。推测解码通过使用轻量级草稿模型来降低延迟,但部署往往受限于获取、调优和维护有效草稿模型的成本与复杂性。现有方法通常需要辅助训练或专门化处理,即便是免训练方法也涉及昂贵的搜索或优化开销。本文提出SDFP,一种完全免训练且即插即用的框架,通过基于费雪信息迹(FIT)的层级剪枝从给定LLM中构建草稿模型。SDFP以层敏感度作为输出扰动的代理指标,通过移除低影响层来获得紧凑的草稿模型,同时保持与原始模型的兼容性以进行标准推测验证。该方法无需额外训练、超参数调优或独立维护的草稿模型,能够实现快速、易于部署的草稿构建。在多个基准测试中,SDFP在不改变目标模型输出分布的前提下,实现了1.32倍至1.5倍的解码加速,为低延迟多媒体应用提供了有效支持。