This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7\%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called Cosco.
翻译:本文提出一种新颖的代码到代码搜索技术,通过在训练过程中同时利用静态与动态特征,并引入相似与不相似样本,增强大型语言模型(LLMs)的性能。我们首次提出一种代码搜索方法,在训练阶段编码动态运行时信息,而无需在推理时执行被搜索语料库或搜索查询;同时这也是首个基于正负参考样本进行训练的代码搜索技术。为验证方法的有效性,我们开展了一系列研究,证明增强型LLMs具备跨语言代码到代码搜索的能力。实验表明,该方法在不同模型架构与编程语言上均保持稳定效果,比现有最先进的跨语言搜索工具性能提升高达44.7%。此外,消融实验揭示:即使训练过程中仅使用单个正负参考样本,也能带来显著的性能提升,表明相似与不相似参考样本均是代码搜索的重要组成部分。值得注意的是,我们的研究发现:经过精心微调的增强型模型始终优于未经微调的增强型大规模现代LLMs,即使对现有最大规模的LLMs进行增强也是如此,这凸显了开源模型的重要性。为确保研究的可重复性与可扩展性,我们开源了工具及训练流程的实现,命名为Cosco。