Pre-trained code models have emerged as crucial tools in various code intelligence tasks. However, their effectiveness depends on the quality of the pre-training dataset, particularly the human reference comments, which serve as a bridge between the programming language and natural language. One significant challenge is that such comments can become inconsistent with the corresponding code as the software evolves. This discrepancy can lead to suboptimal training of the models, decreasing their performances. LLMs have demonstrated superior capabilities in generating high-quality code comments. In light of that, we try to tackle the quality issue of the dataset by harnessing the power of LLMs. Specifically, we raise the question: Can we rebuild the pre-training dataset by substituting the original comments with LLM-generated ones for more effective pre-trained code models? To answer the question, we first conduct a comprehensive evaluation to compare ChatGPT-generated comments with human reference comments. As existing reference-based metrics treat the reference comments as gold standards, we introduce two auxiliary tasks as novel reference-free metrics to assess the quality of comments, i.e., code-comment inconsistency detection and code search. Experimental results show that ChatGPT-generated comments demonstrate superior semantic consistency with the code compared to human references, indicating the potential of utilizing ChatGPT to enhance the quality of the pre-training dataset. We rebuilt the widely used dataset, CodeSearchNet, with ChatGPT-generated comments. Subsequent experiments involve re-pre-training the CodeT5 with our refined dataset.Evaluation results on four generation tasks and one understanding code intelligence tasks show that the model pre-trained by ChatGPT-enhanced data outperforms its counterpart on code summarization, code generation, and code translation tasks.
翻译:预训练代码模型已成为各类代码智能任务中的关键工具。然而,其效果高度依赖于预训练数据集的质量,尤其是作为编程语言与自然语言之间桥梁的人工参考注释。一个重大挑战在于,随着软件持续演进,此类注释可能与对应代码产生不一致。这种差异会导致模型训练效果欠佳,进而降低其性能。大语言模型(LLMs)在生成高质量代码注释方面展现出卓越能力。鉴于此,我们尝试借助LLMs的力量解决数据集的质量问题。具体而言,我们提出以下问题:能否通过用LLM生成的注释替换原始注释来重建预训练数据集,从而获得更有效的预训练代码模型?为回答该问题,我们首先对ChatGPT生成的注释与人工参考注释进行综合评估。鉴于现有基于参考的指标将参考注释视为黄金标准,我们引入两项辅助任务作为新颖的无参考指标来评估注释质量,即代码-注释不一致检测和代码搜索。实验结果表明,与人工参考相比,ChatGPT生成的注释在语义一致性上与代码展现出更优表现,表明利用ChatGPT提升预训练数据集质量的潜力。我们利用ChatGPT生成的注释重建了广泛使用的数据集CodeSearchNet。后续实验包含使用优化后的数据集对CodeT5进行重新预训练。在四项生成任务和一项理解型代码智能任务上的评估结果显示,由ChatGPT增强数据预训练的模型在代码摘要、代码生成和代码翻译任务上均优于其对应版本。