(Source) code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development. Given a functionality requirement usually described in a natural language sentence, a code search system can retrieve code snippets that satisfy the requirement from a large-scale code corpus, e.g., GitHub. To realize effective and efficient code search, many techniques have been proposed successively. These techniques improve code search performance mainly by optimizing three core components, including query understanding component, code understanding component, and query-code matching component. In this paper, we provide a 3-dimensional perspective survey for code search. Specifically, we categorize existing code search studies into query-end optimization techniques, code-end optimization techniques, and match-end optimization techniques according to the specific components they optimize. Considering that each end can be optimized independently and contributes to the code search performance, we treat each end as a dimension. Therefore, this survey is 3-dimensional in nature, and it provides a comprehensive summary of each dimension in detail. To understand the research trends of the three dimensions in existing code search studies, we systematically review 68 relevant literatures. Different from existing code search surveys that only focus on the query end or code end or introduce various aspects shallowly (including codebase, evaluation metrics, modeling technique, etc.), our survey provides a more nuanced analysis and review of the evolution and development of the underlying techniques used in the three ends. Based on a systematic review and summary of existing work, we outline several open challenges and opportunities at the three ends that remain to be addressed in future work.
翻译:(源代码)代码搜索因其能够提升软件开发的生产效率与质量,而受到软件工程研究者的广泛关注。给定一个通常以自然语言句子描述的功能需求,代码搜索系统能够从大规模代码库(如GitHub)中检索出满足该需求的代码片段。为了实现高效且有效的代码搜索,学界已相继提出了许多技术。这些技术主要通过优化三个核心组件来提升代码搜索性能,包括查询理解组件、代码理解组件以及查询-代码匹配组件。本文从三维视角对代码搜索进行了综述。具体而言,我们根据现有代码搜索研究所优化的具体组件,将其分为查询端优化技术、代码端优化技术和匹配端优化技术。考虑到每一端均可独立优化并对代码搜索性能有所贡献,我们将每一端视为一个维度。因此,本综述本质上具有三维特征,并对每个维度进行了详细而全面的总结。为了理解现有代码搜索研究中这三个维度的研究趋势,我们系统性地回顾了68篇相关文献。与现有仅聚焦于查询端或代码端,或对各种方面(包括代码库、评估指标、建模技术等)进行浅尝辄止介绍的代码搜索综述不同,我们的综述对这三个维度中所使用底层技术的演变与发展进行了更为细致的分析与回顾。基于对现有工作的系统性梳理与总结,我们指出了未来工作中仍有待解决的、这三个维度所面临的若干开放性挑战与机遇。