This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7\%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.
翻译:本文提出了一种新颖的代码间搜索技术,通过融合静态与动态特征,并在训练过程中同时利用相似与不相似样本,来增强大语言模型(LLM)的性能。我们首次提出一种编码动态运行时信息的代码搜索方法——该方法在推理时无需执行待搜索语料库或搜索查询,同时也是首个基于正负参考样本进行训练的代码搜索技术。为验证方法的有效性,我们开展了一系列研究,证明了增强型LLM在跨语言代码间搜索中的能力。实验评估表明,该方法在不同模型架构与编程语言中均保持稳定效果,相较当前最优跨语言搜索工具性能提升高达44.7%。此外,消融研究揭示,即使仅在训练过程中使用单个正负参考样本,也能带来显著的性能提升,证实相似与不相似样本均是代码搜索的重要组成部分。值得关注的是,我们证明经精调增强的小型模型始终优于未经微调的增强型大型现代LLM,即使增强最大规模可用的LLM也未能例外,这凸显了开源模型的重要性。为确保研究可复现性与可扩展性,我们开源了名为REINFOREST的工具实现及训练流程。