Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code and queries based on the according prompts and then propose a filtering mechanism which trains a cross-encoder from the backbone model UniXcoder to filter out code and query pairs with low matching scores. Finally, we re-train the backbone model using the obtained high-quality augmented data. Experimental results show that ChatDANCE achieves state-of-the-art performance, improving the best baseline by 13.2% (R@1) and 7% (MRR). Surprisingly, we find that this augment-filter-retrain strategy enables the backbone model (UniXcoder) to self-grow. Moreover, extensive experiments show the effectiveness of each component and ChatDANCE has stable performance under different hyperparameter settings. In addition, we conduct qualitative and quantitative analyses to investigate why ChatDANCE works well and find that it learns a more uniform distribution of representations and effectively aligns the code and query spaces.
翻译:代码搜索在软件开发中扮演着关键角色,它使开发者能够通过自然语言查询检索和复用代码。尽管代码搜索模型的性能随着高质量数据的增加而提升,但获取此类数据既具挑战性又成本高昂。近期,以ChatGPT为代表的大语言模型在自然语言和编程语言的理解与生成方面取得了显著进展,并通过简洁的提示实现了用户友好的交互。受这些进展启发,我们提出了一种新颖方法ChatDANCE,该方法利用大语言模型生成高质量且多样化的增强数据,并采用过滤机制剔除低质量增强样本。具体而言,我们首先设计了一套专门针对源代码和查询的ChatGPT提示规则。随后,我们利用ChatGPT根据相应提示重写代码和查询,并提出一种过滤机制:通过骨干模型UniXcoder训练一个交叉编码器,以过滤掉匹配分数较低的代码-查询对。最后,我们使用获得的高质量增强数据重新训练骨干模型。实验结果表明,ChatDANCE实现了最先进的性能,将最佳基线的R@1和MRR指标分别提升了13.2%和7%。令人惊讶的是,我们发现这种“增强-过滤-重训练”策略能使骨干模型(UniXcoder)实现自我成长。此外,大量实验证明了各组成部分的有效性,且ChatDANCE在不同超参数设置下均保持稳定性能。通过定性与定量分析,我们深入探究了ChatDANCE的有效机制,发现其能够学习到更均匀的表征分布,并有效对齐代码与查询空间。