Effective code retrieval plays a crucial role in advancing code generation, bug fixing, and software maintenance, particularly as software systems increase in complexity. While current code embedding models have demonstrated promise in retrieving code snippets for small-scale, well-defined tasks, they often underperform in more demanding real-world applications such as bug localization within GitHub repositories. We hypothesize that a key issue is their reliance on noisy and inconsistent datasets for training, which impedes their ability to generalize to more complex retrieval scenarios. To address these limitations, we introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives, thereby facilitating more effective learning. We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks. Furthermore, the dataset can be leveraged for training code reranking models, a largely underexplored area compared to text reranking. Our finetuned code reranking model significantly improves the ranking quality over the retrieved results. Finally, by employing our code retriever and reranker together, we demonstrate significant improvements in function localization for GitHub issues, an important component of real-world software development.
翻译:有效的代码检索在推动代码生成、错误修复和软件维护方面起着至关重要的作用,尤其是在软件系统日益复杂的背景下。尽管当前的代码嵌入模型在针对小规模、定义明确的任务检索代码片段方面已展现出潜力,但在更具挑战性的实际应用(例如GitHub仓库中的错误定位)中,其表现往往不尽如人意。我们假设一个关键问题在于这些模型训练时依赖的是噪声多且不一致的数据集,这阻碍了它们泛化到更复杂检索场景的能力。为应对这些局限,我们引入了CoRNStack,一个跨多种编程语言的大规模、高质量代码对比训练数据集。该数据集通过一致性过滤进行精心构建,以消除噪声正例,并进一步通过挖掘的困难负例进行增强,从而促进更有效的学习。我们证明,使用CoRNStack进行嵌入模型的对比训练,能在多种代码检索任务上取得最先进的性能。此外,该数据集还可用于训练代码重排序模型,与文本重排序相比,这一领域在很大程度上尚未得到充分探索。我们微调后的代码重排序模型显著提升了检索结果的排序质量。最后,通过联合使用我们的代码检索器和重排序器,我们在GitHub问题的函数定位方面展示了显著改进,这是实际软件开发中的一个重要环节。