Modern instruction-tuned large language models (LLMs) have made remarkable progress in code generation. However, these LLMs fine-tuned with standard supervised fine-tuning (SFT) sometimes generate plausible-looking but functionally incorrect code variants. This issue likely stems from the limitation of standard SFT, which treats all tokens equally during optimization and fails to emphasize the error-sensitive segments-specific code differences between correct implementations and similar incorrect variants. To address this problem, we propose Fault-Guided Fine-Tuning (FGIT), a novel fine-tuning technique that enhances LLMs' code generation by (1) extracting multi-granularity (line/token-level) differences between correct and incorrect yet similar implementations to identify error-sensitive segments, and (2) dynamically prioritizing those segments during training via dynamic loss weighting. Through extensive experiments on seven LLMs across three widely-used benchmarks, our method achieves an average relative improvement of 6.9% on pass@1 with some enhanced 6.7B LLMs outperforming closed-source models, e.g., GPT-3.5-Turbo. Furthermore, our fine-tuning technique demonstrates strong generalization with performance improvements ranging from 3.8% to 19.1% across diverse instruction-tuned LLMs, and our ablation studies confirm the contributions of different granularities of differences and hyperparameters.
翻译:现代基于指令微调的大语言模型(LLM)在代码生成方面取得了显著进展。然而,这些采用标准监督微调(SFT)的LLM有时会生成看似合理但功能错误的代码变体。该问题可能源于标准SFT的局限性,其在优化过程中平等对待所有标记,未能强调正确实现与相似错误变体之间对错误敏感的特定代码片段差异。为解决此问题,我们提出了故障引导微调(FGIT),这是一种新颖的微调技术,通过以下方式增强LLM的代码生成能力:(1)提取正确实现与相似但错误实现之间的多粒度(行/标记级)差异以识别错误敏感片段;(2)在训练过程中通过动态损失加权动态优先处理这些片段。通过在三个广泛使用的基准测试中对七个LLM进行大量实验,我们的方法在pass@1指标上实现了平均6.9%的相对提升,部分增强后的6.7B参数LLM性能甚至超越了闭源模型(例如GPT-3.5-Turbo)。此外,我们的微调技术展现出强大的泛化能力,在不同指令微调LLM上实现了3.8%至19.1%的性能提升,消融研究也证实了不同粒度差异和超参数对性能的贡献。