Structured width pruning of GLU-MLP layers in Llama-3.2 models, guided by the Peak-to-Peak Magnitude (PPM) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably with decreasing expansion ratios, instruction-following capabilities improve at the 2.4x equilibrium ratio (IFEval: +4.8 points / +46% in Llama-3.2-1B and +3.7 points / +39% in Llama-3.2-3B), and multi-step reasoning remains robust (MUSR). This pattern, observed consistently across both evaluated model sizes, challenges the prevailing assumption in compression research that pruning induces uniform degradation. To investigate this, we evaluated seven expansion ratio configurations using comprehensive benchmark suites that assess factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively reshapes the model's task performance profile, rather than merely serving as a compression metric.
翻译:对Llama-3.2模型中GLU-MLP层进行结构化宽度剪枝,以峰-峰幅度(PPM)准则为指导,揭示了降低扩展比率对不同模型能力产生系统性二分效应。尽管依赖参数化知识的任务(如MMLU、GSM8K)及困惑度指标的性能随扩展比率降低而呈可预测性退化,但指令遵循能力在2.4倍均衡比率下反而提升(Llama-3.2-1B中IFEval指标提升+4.8分/+46%,Llama-3.2-3B中提升+3.7分/+39%),多步推理能力(MUSR)保持稳健。这一模式在两个评估模型规模中均一致观测到,挑战了压缩研究中认为剪枝导致均匀退化的主流假设。为探究此现象,我们通过综合基准套件评估了七种扩展比率配置,涵盖事实知识、数学推理、语言理解、指令遵循及真实性维度。分析表明,扩展比率是选择性重塑模型任务性能分布的关键架构参数,而非仅作为压缩度量指标。