Code search is an important task that has seen many developments in recent years. However, previous attempts have mostly considered the problem of searching for code by a text query. We argue that using a code snippet (and possibly an associated traceback) as a query and looking for answers with bugfixing instructions and code samples is a natural use case that is not covered by existing approaches. Moreover, existing datasets use comments extracted from code rather than full-text descriptions as text, making them unsuitable for this use case. We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; it turns out that in this setting, existing architectures fall short of the simplest BM25 baseline even after fine-tuning. We present a new single encoder model SnippeR that outperforms several strong baselines on the SearchBySnippet dataset with a result of 0.451 Recall@10; we propose the SearchBySnippet dataset and SnippeR as a new important benchmark for code search evaluation.
翻译:代码搜索是近年来取得诸多进展的重要任务。然而,以往的研究主要关注通过文本查询进行代码搜索的问题。我们认为,使用代码片段(以及可能相关的回溯信息)作为查询,并寻找包含错误修复指令和代码样本的答案,是一种现有方法尚未覆盖的自然应用场景。此外,现有数据集使用从代码中提取的注释而非全文描述作为文本,导致其不适用于该场景。我们基于StackOverflow数据构建了新型SearchBySnippet数据集,实现了"以码搜码"的应用场景;研究发现,在此设定下,现有架构即使经过微调,其表现仍不及最简单的BM25基线模型。我们提出新型单编码器模型SnippeR,在SearchBySnippet数据集上以0.451的Recall@10指标超越多个强基线模型。我们将SearchBySnippet数据集与SnippeR模型定位为代码搜索评估的重要新基准。