Code Large Language Models (Code LLMs), such as Code llama and DeepSeek-Coder, have demonstrated exceptional performance in the code generation tasks. However, most existing models focus on the abilities of generating correct code, but often struggle with bug repair. We introduce a suit of methods to enhance LLM's SQL bug-fixing abilities. The methods are mainly consisted of two parts: A Progressive Dataset Construction (PDC) from scratch and Dynamic Mask Supervised Fine-tuning (DM-SFT). PDC proposes two data expansion methods from the perspectives of breadth first and depth first respectively. DM-SFT introduces an efficient bug-fixing supervised learning approach, which effectively reduce the total training steps and mitigate the "disorientation" in SQL code bug-fixing training. In our evaluation, the code LLM models trained with two methods have exceeds all current best performing model which size is much larger.
翻译:代码大语言模型(Code LLMs),如Code Llama和DeepSeek-Coder,在代码生成任务中已展现出卓越性能。然而,现有模型大多侧重于生成正确代码的能力,在错误修复方面往往表现欠佳。本文提出一套增强大语言模型SQL错误修复能力的方法体系。该方法主要由两部分构成:从零开始的渐进式数据集构建(PDC)与动态掩码监督微调(DM-SFT)。PDC分别从广度优先和深度优先两个维度提出了两种数据扩展方法。DM-SFT引入了一种高效的错误修复监督学习方法,能有效减少总训练步数,并缓解SQL代码错误修复训练中的“方向迷失”问题。实验评估表明,采用这两种方法训练的代码大语言模型,其性能超越了当前所有规模更大的最优模型。