While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. Our code is available at https://github.com/XiaoMi/subllm.
翻译:尽管大语言模型(LLMs)在各个领域取得了显著成功,但其训练和推理效率仍然是一个主要挑战。为解决此问题,我们提出了SUBLLM(子采样-上采样-旁路大语言模型),这是一种创新的架构,它通过引入子采样、上采样和旁路模块,扩展了仅解码器的核心框架。子采样模块负责缩短序列长度,而上采样模块则恢复序列长度,旁路模块则用于增强收敛性。与LLaMA相比,所提出的SUBLLM在训练和推理速度以及内存使用方面均表现出显著提升,同时保持了具有竞争力的少样本性能。在训练过程中,SUBLLM将速度提高了26%,并将每个GPU的内存占用减少了10GB。在推理过程中,它将速度提升了高达37%,并将每个GPU的内存占用减少了1GB。当上下文窗口扩展至8192时,其训练和推理速度可分别提升34%和52%。我们的代码可在 https://github.com/XiaoMi/subllm 获取。