This paper introduces SparseOptimizer, a novel deep learning optimizer that exploits Moreau-Yosida regularization to naturally induce sparsity in large language models such as BERT, ALBERT and GPT. Key to the design of SparseOptimizer is an embedded shrinkage operator, which imparts sparsity directly within the optimization process. This operator, backed by a sound theoretical framework, includes an analytical solution, thereby reinforcing the optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play functionality eradicates the need for code modifications, making it a universally adaptable tool for a wide array of large language models. Empirical evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2 confirm that SparseBERT and SparseALBERT, when sparsified using SparseOptimizer, achieve performance comparable to their dense counterparts, BERT and ALBERT, while significantly reducing their parameter count. Further, this work proposes an innovative optimizer-compiler co-design strategy, demonstrating the potential of inference acceleration (\textbf{3.37x}, \textbf{6.30x}, and \textbf{7.15x} in comparison with Pytorch, TensorFlow, and LLVM generic compile, respectively) in SparseBERT when paired with an appropriately designed compiler. This study represents a significant step forward in the evolution of efficient, scalable, and high-performing large language models, setting a precedent for future exploration and optimization in this domain. The SparseOptimizer code and SparseALBERT model will be made available upon paper acceptance.
翻译:本文提出SparseOptimizer,一种新型深度学习优化器,通过利用Moreau-Yosida正则化在BERT、ALBERT和GPT等大型语言模型中自然诱导稀疏性。其设计的核心是嵌入收缩算子,该算子直接在优化过程中引入稀疏性,并基于坚实的理论框架包含解析解,从而增强优化器的鲁棒性与有效性。尤为关键的是,SparseOptimizer的即插即用功能消除了代码修改需求,使其成为适用于多种大型语言模型的通用工具。在GLUE、RACE、SQuAD1和SQuAD2等基准数据集上的实证评估证实,经SparseOptimizer稀疏化后的SparseBERT和SparseALBERT在显著减少参数数量的同时,性能与其密集版本BERT和ALBERT相当。此外,本研究提出一种创新的优化器-编译器协同设计策略,展示SparseBERT在与适当设计的编译器配合时实现推理加速(相比PyTorch、TensorFlow和LLVM通用编译分别加速\textbf{3.37倍}、\textbf{6.30倍}和\textbf{7.15倍})。该研究为发展高效、可扩展且高性能的大型语言模型迈出了重要一步,并为该领域的未来探索与优化奠定先例。SparseOptimizer代码和SparseALBERT模型将在论文被接收后公开发布。