The increasing size of language models raises great research interests in parameter-efficient fine-tuning such as LoRA that freezes the pre-trained model, and injects small-scale trainable parameters for multiple downstream tasks (e.g., summarization, question answering and translation). To further enhance the efficiency of fine-tuning, we propose a framework that integrates LoRA and structured layer pruning. The integrated framework is validated on two created deidentified medical report summarization datasets based on MIMIC-IV-Note and two public medical dialogue datasets. By tuning 0.6% parameters of the original model and pruning over 30% Transformer-layers, our framework can reduce 50% of GPU memory usage and speed up 100% of the training phase, while preserving over 92% generation qualities on free-text sequence-to-sequence tasks.
翻译:随着语言模型规模的日益增大,参数高效微调方法(如冻结预训练模型、注入少量可训练参数的LoRA)在多下游任务(如摘要生成、问答和翻译)中引发了广泛研究兴趣。为进一步提升微调效率,我们提出了一种融合LoRA与结构化层剪枝的框架。该集成框架在两个基于MIMIC-IV-Note构建的去标识化医疗报告摘要数据集以及两个公开医疗对话数据集上进行了验证。通过仅调整原始模型0.6%的参数并剪枝超过30%的Transformer层,我们的框架可减少50%的GPU内存占用,加速100%的训练阶段,同时在自由文本序列到序列任务中保留超过92%的生成质量。