Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance--a dissociation with implications for interpretability under compression.
翻译:权重剪枝是压缩大规模语言模型的标准技术,但其对学习到的内部表征的影响仍知之甚少。我们首次系统研究了非结构化剪枝如何重塑语言模型的特征几何结构,使用稀疏自编码器(SAEs)作为可解释性探针。涵盖三个模型家族(Gemma 3 1B、Gemma 2 2B、Llama 3.2 1B)、两种剪枝方法(幅值剪枝和Wanda剪枝)及六个稀疏度水平(0–60%),我们探究了涵盖种子稳定性、特征存活、SAE可迁移性、特征脆弱性及因果相关性五个研究问题。最显著的发现是:稀疏SAE特征(即低触发频率的特征)比频繁特征更能在剪枝中存活——在17个实验条件中的11个条件下,组内斯皮尔曼相关系数rho = -1.0。这一反直觉结果表明,剪枝实质上是隐式特征选择,优先破坏高频率的通用特征,同时保留专门的稀有特征。我们进一步表明:与幅值剪枝相比,Wanda剪枝对特征结构的保留效果提升了3.7倍;预训练的SAEs在Wanda剪枝模型上仍可在50%稀疏度下保持有效性;且几何特征存活并不能预测因果重要性——这种分离对压缩下的可解释性具有重要启示。