Neural networks are often challenging to work with due to their large size and complexity. To address this, various methods aim to reduce model size by sparsifying or decomposing weight matrices, such as magnitude pruning and low-rank or block-diagonal factorization. In this work, we present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Although solving this problem exactly is computationally infeasible, we propose an efficient heuristic based on alternating minimization via ADMM that achieves state-of-the-art results, enabling unprecedented sparsification of neural networks. For instance, in a one-shot pruning setting, our method can reduce the size of the LLaMA2-13B model by 50% while maintaining better performance than the dense LLaMA2-7B model. We also compare favorably with Optimal Brain Compression, the state-of-the-art layer-wise pruning approach for convolutional neural networks. Furthermore, accuracy improvements of our method persist even after further model fine-tuning. Code available at: https://github.com/usamec/double_sparse.
翻译:神经网络因其庞大的规模和复杂的结构而常常难以处理。为应对这一挑战,多种方法致力于通过稀疏化或分解权重矩阵来减小模型规模,例如幅度剪枝以及低秩或块对角分解。本研究提出双重稀疏分解方法,将每个权重矩阵分解为两个稀疏矩阵。尽管精确求解该问题在计算上不可行,但我们提出一种基于交替方向乘子法交替最小化的高效启发式算法,该算法取得了最先进的性能,实现了前所未有的神经网络稀疏化。例如,在一次性剪枝设定下,我们的方法能够将LLaMA2-13B模型的规模缩减50%,同时保持优于稠密LLaMA2-7B模型的性能。与当前卷积神经网络层间剪枝的最优方法——最优大脑压缩相比,我们的方法同样表现优异。此外,即使经过进一步的模型微调,本方法带来的精度提升依然得以保持。代码发布于:https://github.com/usamec/double_sparse。