Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.
翻译:大型基础模型(LFMs)通过规模化实现了强大的性能,然而当前的结构化剪枝方法在推理过程中采用固定的剪枝决策,忽视了自回归令牌生成过程中出现的稀疏性模式。本文提出POP(分区引导的在线剪枝),一种高效的在线结构化剪枝框架,能够以最小的计算开销实现上下文条件化的动态剪枝。POP将模型通道划分为保留区、候选区和剪枝区,其中预填充阶段定义粗粒度剪枝分区,解码阶段在候选区内生成细粒度掩码,避免了全通道的重新评估。粗粒度剪枝分区保留了持续重要的权重,而细粒度掩码则在解码过程中提供上下文条件化的变化。此外,POP是一种轻量级即插即用方法,无需任何预处理,包括离线校准、重新训练或学习预测器。在多种大型基础模型(包括大语言模型(LLMs)、专家混合模型(MoEs)和视觉语言模型(VLMs))上的广泛评估表明,POP始终比现有剪枝方法提供更高的准确性,同时产生更小的计算开销并最小化推理延迟。