Parameter-Efficient Multi-Task Fine-Tuning in Code-Related Tasks

Large Language Models (LLMs) have proven highly effective in automating software engineering tasks, bridging natural language and code semantics to achieve notable results in code generation and summarization. However, their scale incurs substantial computational costs, making full fine-tuning impractical. Parameter-Efficient Fine-Tuning (PEFT) methods like QLoRA enable efficient specialization with lower resource demands. Recent studies show QLoRA-optimized Large Code Models (LCMs) perform strongly across diverse tasks, yet it remains unclear whether this effectiveness persists when a single model is QLoRA fine-tuned for multiple code-related tasks. The interaction between Multi-task fine-tuning and QLoRA optimization, and how transfer learning affects correctness and quality of generated artifacts, remains largely unexplored. We investigate Multi-task QLoRA fine-tuning across three representative tasks: code generation, translation, and summarization. We evaluate functional correctness through execution-based and similarity-based metrics, complemented by comprehensive code quality analysis--an aspect largely overlooked in prior work. Our findings show that Multi-task QLoRA effectively leverages transfer learning, achieving competitive or superior performance relative to both Single-task QLoRA and Multi-task full fine-tuning. Larger models demonstrate more consistent balance between correctness and quality, whereas smaller models preserve functionality but exhibit a higher incidence of quality-related issues.

翻译：大型语言模型（LLMs）已被证明在自动化软件工程任务方面极为有效，其通过桥接自然语言与代码语义，在代码生成与摘要任务中取得了显著成果。然而，其庞大的规模带来了巨大的计算成本，使得全参数微调难以实际应用。参数高效微调（PEFT）方法（如QLoRA）能够以较低的资源需求实现高效的专业化适配。近期研究表明，经QLoRA优化的大型代码模型（LCMs）在多样化任务中表现优异，但尚不清楚当单一模型通过QLoRA微调以处理多个代码相关任务时，这种有效性是否依然保持。多任务微调与QLoRA优化之间的相互作用，以及迁移学习如何影响生成产物的正确性与质量，目前仍缺乏深入探索。本研究针对三项代表性任务——代码生成、代码翻译与代码摘要——探究了多任务QLoRA微调的效果。我们通过基于执行的测试和基于相似度的度量来评估功能正确性，并辅以全面的代码质量分析（这一维度在以往工作中常被忽视）。实验结果表明，多任务QLoRA能有效利用迁移学习，相较于单任务QLoRA与多任务全参数微调，取得了具有竞争力或更优的性能。较大规模的模型在正确性与质量之间展现出更稳定的平衡，而较小模型虽能保持功能实现，却表现出更高质量相关问题的发生频率。