The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
翻译:编程能力是大语言模型(LLMs)的关键技能之一,需要深入理解编程语言(PLs)及其与自然语言(NLs)的关联。我们通过评估注释密度作为PL-NL对齐程度的度量标准,研究了预训练数据对代码类LLMs性能的影响。针对预训练语料中代码-注释对齐数据稀缺的问题,我们提出了一种新颖的数据增强方法——为现有代码生成注释,并配合数据过滤策略,过滤掉与自然语言关联性较差的代码数据。我们在三个代码类LLMs上进行了实验,观察到在两种广泛使用的编程技能基准测试中,性能均获得持续提升。值得注意的是,基于增强数据训练的模型不仅超越了用于生成注释的原始模型,也显著优于未经数据增强继续训练的模型。