Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.
翻译:Transformer模型在各种自然语言任务中取得了显著成果,但这类模型通常体量庞大,需要海量内存和计算资源。为减小模型规模与复杂度,我们提出LoSparse(低秩与稀疏近似)——一种新颖的模型压缩技术,通过将权重矩阵近似为低秩矩阵与稀疏矩阵之和来实现压缩。该方法融合了低秩近似和剪枝的优势,同时规避了二者的局限性:低秩近似压缩神经元中具有连贯性和表达性的部分,而剪枝移除神经元中缺乏连贯性和表达性的部分。剪枝增强了低秩近似的多样性,低秩近似则防止剪枝丢失过多关键表达能力。我们在自然语言理解、问答系统和自然语言生成任务上进行了评估,结果表明该方法显著优于现有压缩技术。