Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets face limitations: they rely on human annotators who assess code primarily through semantic understanding rather than functional verification, leading to potential inaccuracies and scalability issues. Additionally, current evaluation metrics often overlook the multi-choice nature of code search. This paper introduces CoSQA+, pairing high-quality queries from CoSQA with multiple suitable codes. We develop an automated pipeline featuring multiple model-based candidate selections and the novel test-driven agent annotation system. Among a single Large Language Model (LLM) annotator and Python expert annotators (without test-based verification), agents leverage test-based verification and achieve the highest accuracy of 93.9%. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. We publicly release both CoSQA+_all, which contains 412,080 agent-annotated pairs, and CoSQA+_verified, which contains 1,000 human-verified pairs, at https://github.com/DeepSoftwareAnalytics/CoSQA_Plus.
翻译:语义代码搜索,即检索与给定自然语言查询相匹配的代码,是提高软件工程生产力的重要任务。现有的代码搜索数据集面临局限性:它们依赖人工标注者,而标注者主要通过语义理解而非功能验证来评估代码,这可能导致不准确性和可扩展性问题。此外,当前的评估指标常常忽视代码搜索的多选特性。本文介绍了CoSQA+,它将来自CoSQA的高质量查询与多个合适的代码配对。我们开发了一个自动化流程,其特点是包含基于多模型的候选选择以及新颖的测试驱动智能体标注系统。在单个大型语言模型(LLM)标注者和Python专家标注者(未进行基于测试的验证)中,智能体利用基于测试的验证,实现了93.9%的最高准确率。通过大量实验,CoSQA+已证明其质量优于CoSQA。在CoSQA+上训练的模型表现出改进的性能。我们在 https://github.com/DeepSoftwareAnalytics/CoSQA_Plus 公开发布了CoSQA+_all(包含412,080对智能体标注的配对)和CoSQA+_verified(包含1,000对人工验证的配对)。