This paper introduces SparseOptimizer, a novel deep learning optimizer that exploits Moreau-Yosida regularization to naturally induce sparsity in large language models such as BERT, ALBERT and GPT. Key to the design of SparseOptimizer is an embedded shrinkage operator, which imparts sparsity directly within the optimization process. This operator, backed by a sound theoretical framework, includes an analytical solution, thereby reinforcing the optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play functionality eradicates the need for code modifications, making it a universally adaptable tool for a wide array of large language models. Empirical evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2 confirm that SparseBERT and SparseALBERT, when sparsified using SparseOptimizer, achieve performance comparable to their dense counterparts, BERT and ALBERT, while significantly reducing their parameter count. Further, this work proposes an innovative optimizer-compiler co-design strategy, demonstrating the potential of inference acceleration (\textbf{3.37x}, \textbf{6.30x}, and \textbf{7.15x} in comparison with Pytorch, TensorFlow, and LLVM generic compile, respectively) in SparseBERT when paired with an appropriately designed compiler. This study represents a significant step forward in the evolution of efficient, scalable, and high-performing large language models, setting a precedent for future exploration and optimization in this domain. The SparseOptimizer code and SparseALBERT model will be publicly available upon paper acceptance.
翻译:本文提出SparseOptimizer,一种利用Moreau-Yosida正则化在BERT、ALBERT及GPT等大型语言模型中自然诱导稀疏性的新型深度学习优化器。其核心设计在于嵌入收缩算子,该算子能在优化过程中直接赋予模型稀疏性。该收缩算子基于严谨的理论框架,包含解析解,从而增强了优化器的稳健性与有效性。尤为关键的是,SparseOptimizer的即插即用特性消除了代码修改需求,使其成为适用于各类大型语言模型的通用适配工具。在GLUE、RACE、SQuAD1及SQuAD2等基准数据集上的实证评估表明,经SparseOptimizer稀疏化后的SparseBERT与SparseALBERT在实现与原始密集模型BERT及ALBERT相当性能的同时,显著降低了参数量。此外,本研究提出创新的优化器-编译器协同设计策略,展示了SparseBERT在适配特定编译器时的推理加速潜力(相较于PyTorch、TensorFlow及LLVM通用编译分别实现\texbf{3.37倍}、\texbf{6.30倍}与\texbf{7.15倍}加速)。该工作为构建高效、可扩展的高性能大型语言模型迈出关键一步,并为该领域的未来探索与优化树立了典范。论文接收后,SparseOptimizer代码与SparseALBERT模型将公开提供。