Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.
翻译:大型语言模型(LLM)在许多领域展现出强大的性能,但由于其庞大的规模,在资源受限的环境中部署存在困难。低秩权重矩阵压缩是一种流行的模型压缩策略,通常通过在假设权重矩阵具有低秩结构的前提下最小化权重重构误差来实现。然而,这一假设在LLM中往往并不成立。相反,LLM的激活表现出更强的低秩结构,这促使我们将关注点转向最小化激活重构误差。我们证明,仅此转变是不够的:激活的不同维度对模型性能的贡献是不均衡的,均匀重构可能会损害性能。为此,我们提出了IMPACT,一个原则性的、重要性感知的激活重构框架,它将模型压缩决策与其对模型行为的影响联系起来。IMPACT构建了一个同时考虑激活结构和梯度敏感性的优化问题,并推导出一个闭式解,其中最优的重构基向量是一个重要性加权的激活协方差矩阵的特征向量。这使得低秩近似能够被明确地优化以保持模型精度。在多种模型和任务上的实验表明,IMPACT在保持与最先进基线方法相当的精度前提下,实现了高达48.6%的模型规模缩减。