In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.
翻译:近年来,大语言模型推动了自然语言处理的进展,但其规模持续增长加剧了计算负担,亟需在效率与性能间寻求平衡。低秩压缩作为一项前景广阔的技术,通过将权重矩阵分解为两个低秩矩阵的乘积来减少非必要参数,然而该技术在LLM中的应用尚未得到充分研究。低秩压缩的核心在于低秩分解与低秩维度分配。针对LLM低秩压缩面临的挑战,我们通过实证研究探索了大模型的低秩特性,并提出一种适用于LLM的低秩压缩方法。该方法通过池化协方差矩阵实现特征分布的精确估计,并采用贝叶斯优化策略分配低秩维度。在LLaMA-2模型上的实验表明,在相同压缩比下,本方法在维持模型性能方面优于现有强结构化剪枝及低秩压缩技术。