Large language models (LLMs) are computationally intensive. The computation workload and the memory footprint grow quadratically with the dimension (layer width). Most of LLMs' parameters come from the linear layers of the transformer structure and are highly redundant. These linear layers contribute more than 80% of the computation workload and 99% of the model size. To pretrain and finetune LLMs efficiently, there are three major challenges to address: 1) reducing redundancy of the linear layers; 2) reducing GPU memory footprint; 3) improving GPU utilization when using distributed training. Prior methods, such as LoRA and QLoRA, utilized low-rank matrices and quantization to reduce the number of trainable parameters and model size, respectively. However, the resulting model still consumes a large amount of GPU memory. In this paper, we present high-performance GPU-based methods that exploit low-rank structures to pretrain and finetune LLMs for financial applications. We replace one conventional linear layer of the transformer structure with two narrower linear layers, which allows us to reduce the number of parameters by several orders of magnitude. By quantizing the parameters into low precision (8-bit and 4-bit), the memory consumption of the resulting model is further reduced. Compared with existing LLMs, our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop. For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks, respectively, and GPU memory consumption ratio of 6.3X. The sizes of our models are smaller than 0.59 GB, allowing inference on a smartphone.
翻译:大语言模型(LLMs)计算密集,其计算负载和内存占用随维度(层宽度)呈二次方增长。LLMs的大部分参数来自Transformer结构的线性层,且存在高度冗余——这些线性层贡献了超过80%的计算负载和99%的模型规模。为高效预训练和微调LLMs,需解决三大挑战:1)降低线性层冗余;2)减少GPU内存占用;3)提升分布式训练时的GPU利用率。现有方法如LoRA和QLoRA分别利用低秩矩阵和量化技术减少可训练参数量和模型规模,但产生的模型仍消耗大量GPU内存。本文提出基于高性能GPU的方法,利用低秩结构对金融领域的LLMs进行预训练和微调。我们将Transformer结构中一个常规线性层替换为两个更窄的线性层,使参数量降低数个数量级;通过将参数量化为低精度(8比特和4比特),进一步减少模型内存消耗。与现有LLMs相比,本方法在预训练阶段实现1.3倍加速和2.64倍模型压缩比,且精度无损;在微调阶段,通用任务和金融任务的平均准确率分别提升6.3%和24.0%,GPU内存占用比达6.3倍。模型体积小于0.59 GB,可部署于智能手机进行推理。