Large Language Models (LLMs) have proven highly effective in automating software engineering tasks, bridging natural language and code semantics to achieve notable results in code generation and summarization. However, their scale incurs substantial computational costs, making full fine-tuning impractical. Parameter-Efficient Fine-Tuning (PEFT) methods like QLoRA enable efficient specialization with lower resource demands. Recent studies show QLoRA-optimized Large Code Models (LCMs) perform strongly across diverse tasks, yet it remains unclear whether this effectiveness persists when a single model is QLoRA fine-tuned for multiple code-related tasks. The interaction between Multi-task fine-tuning and QLoRA optimization, and how transfer learning affects correctness and quality of generated artifacts, remains largely unexplored. We investigate Multi-task QLoRA fine-tuning across three representative tasks: code generation, translation, and summarization. We evaluate functional correctness through execution-based and similarity-based metrics, complemented by comprehensive code quality analysis--an aspect largely overlooked in prior work. Our findings show that Multi-task QLoRA effectively leverages transfer learning, achieving competitive or superior performance at the 1.5B, 3B, and 7B configurations relative to both Single-task QLoRA and Multi-task full fine-tuning. Larger models demonstrate more consistent balance between correctness and quality, whereas smaller models preserve functionality but exhibit a higher incidence of quality-related issues.
翻译:大型语言模型在自动化软件工程任务中展现出卓越性能,能够桥接自然语言与代码语义,在代码生成和摘要等任务中取得显著成果。然而,其规模带来了巨大的计算成本,使得全参数微调变得不切实际。诸如QLoRA等参数高效微调方法能以较低资源需求实现高效特化。最新研究表明,经QLoRA优化的大型代码模型在各类任务中表现强劲,但尚未明确这种有效性是否适用于单个模型对多个代码相关任务进行QLoRA微调的情形。多任务微调与QLoRA优化之间的交互机制,以及迁移学习如何影响生成产物的正确性和质量,仍属未充分探索领域。本研究针对三项代表性任务(代码生成、翻译与摘要)开展多任务QLoRA微调实验。我们通过基于执行与基于相似度的指标评估功能正确性,并辅以全面的代码质量分析——这一维度在先前工作中长期被忽视。研究结果表明,多任务QLoRA能有效利用迁移学习,在1.5B、3B和7B参数配置下取得与单任务QLoRA及多任务全参数微调相当或更优的性能。较大规模模型在正确性与质量之间展现出更一致的平衡,而较小模型虽能保持功能正确性,但质量问题发生率更高。