Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.
翻译:大型语言模型(LLMs)在许多领域展现出强大的性能,但由于其规模庞大,在资源受限的环境中难以部署。低秩权重矩阵压缩是一种常用的模型压缩策略,通常基于权重矩阵具有低秩特性的假设,通过最小化权重重构误差来实现。然而,这一假设在LLMs中往往并不成立。相反,LLM的激活表现出更强的低秩结构,这促使研究转向最小化激活重构误差。我们发现,仅此转变仍不足够:激活的不同维度对模型性能的贡献并不均等,均匀重构可能损害性能。为此,我们提出IMPACT,一个重要性感知的激活重构原则性框架,它将模型压缩决策与其对模型行为的影响联系起来。IMPACT构建了一个同时考虑激活结构和梯度敏感性的优化问题,并推导出一个闭式解,其中最优重构基是重要性加权激活协方差矩阵的特征向量。这使得低秩近似能够明确地以保持精度为目标进行优化。在多种模型和任务上的实验表明,IMPACT在保持与最先进基线方法相当的精度前提下,实现了高达48.6%的模型规模缩减。