Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based plans via LLM based transformations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves the performance by up to 30% compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on 15% of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCoder models.
翻译:自然语言到代码生成是大语言模型(LLMs)的重要应用领域,已受到学界的广泛关注。多数相关研究主要聚焦于增加训练集的规模与功能正确性,却忽视了程序的其他风格特征。近期,数据质量引发学界高度兴趣,多项研究已证实其对提升性能的重要性。本研究从代码数据质量切入,发现提升代码的结构化与可读性能够有效增强系统的代码生成性能。我们构建了一种新型数据清洗流水线,通过以下原则对现有程序进行转换:1)重命名变量,2)将复杂代码模块化分解为更小的辅助子函数,3)通过基于LLM的变换插入自然语言注释方案。我们在两个具有挑战性的算法代码生成基准上评估该方法,发现使用经模块化转换后的程序微调CodeLLaMa-7B,其性能相较于在原始数据集上微调提升高达30%。此外,我们证实使用更少量但更高质量的数据能带来性能增益:在完整原始数据集上微调的模型,其表现不及仅使用15%清洗后数据集训练的模型。即便与闭源模型相比,我们的模型仍超越体量更大的AlphaCoder模型。