Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
翻译:大型语言模型(LLM)与视觉语言模型(VLM)已展现出卓越的能力,但其部署受到显著计算成本的阻碍。现有的结构化剪枝方法虽然硬件效率高,但通常伴随着显著的精度下降。本文认为,这种失效源于一种阶段无关的剪枝方法,其忽视了预填充阶段与解码阶段之间的不对称角色。通过引入虚拟门机制,我们的重要性分析表明,深层网络对于下一令牌预测(解码)至关重要,但在上下文编码(预填充)阶段则基本冗余。基于这一洞见,我们提出了仅预填充阶段剪枝(POP),这是一种阶段感知的推理策略:在计算密集的预填充阶段安全地省略深层网络,同时在敏感的解码阶段保留完整模型。为实现阶段间的切换,我们引入了独立的键-值(KV)投影以维持缓存完整性,并设计了一种边界处理策略来确保首个生成令牌的准确性。在Llama-3.1、Qwen3-VL和Gemma-3等多种模态模型上进行的大量实验表明,POP在预填充延迟上实现了高达1.37$\times$的加速,且性能损失极小,有效克服了现有结构化剪枝方法在精度与效率之间的权衡局限。