The increasing size of language models raises great research interests in parameter-efficient fine-tuning such as LoRA that freezes the pre-trained model, and injects small-scale trainable parameters for multiple downstream tasks (e.g., summarization, question answering and translation). To further enhance the efficiency of fine-tuning, we propose a framework that integrates LoRA and structured layer pruning. The integrated framework is validated on two created deidentified medical report summarization datasets based on MIMIC-IV-Note and two public medical dialogue datasets. By tuning 0.6% parameters of the original model and pruning over 30% Transformer-layers, our framework can reduce 50% of GPU memory usage and speed up 100% of the training phase, while preserving over 92% generation qualities on free-text sequence-to-sequence tasks.
翻译:语言模型规模的日益增长激发了人们对参数高效微调方法的广泛研究兴趣,例如LoRA方法通过冻结预训练模型并注入小规模可训练参数,以支持多种下游任务(如摘要生成、问答和翻译)。为了进一步提升微调效率,我们提出了一个融合LoRA与结构化层剪枝的框架。该集成框架在基于MIMIC-IV-Note创建的两个去标识化医疗报告摘要数据集以及两个公开的医疗对话数据集上进行了验证。通过仅调整原始模型0.6%的参数并剪枝超过30%的Transformer层,我们的框架能在自由文本序列到序列任务中减少50%的GPU内存占用、加速100%的训练阶段,同时保留超过92%的生成质量。