You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code and queries based on the according prompts and then propose a filtering mechanism which trains a cross-encoder from the backbone model UniXcoder to filter out code and query pairs with low matching scores. Finally, we re-train the backbone model using the obtained high-quality augmented data. Experimental results show that ChatDANCE achieves state-of-the-art performance, improving the best baseline by 13.2% (R@1) and 7% (MRR). Surprisingly, we find that this augment-filter-retrain strategy enables the backbone model (UniXcoder) to self-grow. Moreover, extensive experiments show the effectiveness of each component and ChatDANCE has stable performance under different hyperparameter settings. In addition, we conduct qualitative and quantitative analyses to investigate why ChatDANCE works well and find that it learns a more uniform distribution of representations and effectively aligns the code and query spaces.

翻译：代码搜索在软件开发中扮演着关键角色，它使开发者能够通过自然语言查询检索和复用代码。尽管代码搜索模型的性能随着高质量数据的增加而提升，但获取此类数据既具挑战性又成本高昂。近期，以ChatGPT为代表的大语言模型在自然语言和编程语言的理解与生成方面取得了显著进展，并通过简洁的提示实现了用户友好的交互。受这些进展启发，我们提出了一种新颖方法ChatDANCE，该方法利用大语言模型生成高质量且多样化的增强数据，并采用过滤机制剔除低质量增强样本。具体而言，我们首先设计了一套专门针对源代码和查询的ChatGPT提示规则。随后，我们利用ChatGPT根据相应提示重写代码和查询，并提出一种过滤机制：通过骨干模型UniXcoder训练一个交叉编码器，以过滤掉匹配分数较低的代码-查询对。最后，我们使用获得的高质量增强数据重新训练骨干模型。实验结果表明，ChatDANCE实现了最先进的性能，将最佳基线的R@1和MRR指标分别提升了13.2%和7%。令人惊讶的是，我们发现这种“增强-过滤-重训练”策略能使骨干模型（UniXcoder）实现自我成长。此外，大量实验证明了各组成部分的有效性，且ChatDANCE在不同超参数设置下均保持稳定性能。通过定性与定量分析，我们深入探究了ChatDANCE的有效机制，发现其能够学习到更均匀的表征分布，并有效对齐代码与查询空间。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日