As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
翻译:随着大型语言模型(LLMs)规模的持续扩大,训练后剪枝已成为一种在保持性能的同时降低计算成本的有效方法。现有方法如SparseGPT和Wanda通过逐层权重重构或激活感知的幅度剪枝实现了高稀疏度,但其依赖均匀或人工设计的启发式规则来确定各层稀疏比。此外,近期研究表明,经过剪枝的LLMs存在严重的事实知识退化问题,其中结构化剪枝方法在事实问答能力上出现近乎完全的崩溃。本文提出智能体引导剪枝方法,通过让基础模型作为自适应剪枝智能体,在每次迭代中智能选择需要剪枝的层,同时保留关键知识路径。我们的方法结合了Wanda启发的权重-激活度量与梯度重要性分数,通过z-score归一化构建跨模型可比的逐层敏感度剖面。这些统计数据由具备自反思能力的LLM智能体进行处理,使其能够从先前的剪枝结果中学习并迭代优化策略。通过设立检查点回滚机制,当困惑度退化超过阈值时恢复模型状态以保持模型质量。我们在Qwen3模型(4B和8B参数)上以约45%的稀疏度评估了该方法,相比结构化剪枝基线取得了显著改进:MMLU准确率相对提升56%,FreebaseQA上的事实知识保留能力提高19倍,困惑度退化降低69%。值得注意的是,该框架无需重新训练,以模型无关的方式运行,在21-40次迭代中仅需2-4次回滚即可实现有效的自校正,这证明基础模型能够有效指导其他基础模型的压缩过程。