Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.
翻译:代码检索在现代软件开发中至关重要,它能提升代码复用性并加速调试过程。然而,现有基准主要强调功能相关性,忽视了软件质量的关键维度。针对这一空白,我们提出CoQuIR——首个大规模、多语言基准,专门评估四个关键维度上的质量感知代码检索:正确性、效率、安全性与可维护性。CoQuIR为42,725个查询和134,907个代码片段(涵盖11种编程语言)提供了细粒度质量标注,并配套两种质量中心评估指标:成对偏好准确率(Pairwise Preference Accuracy)和边际排序得分(Margin-based Ranking Score)。利用CoQuIR,我们对23个检索模型(涵盖开源与专有系统)进行基准测试,发现即使顶级模型也常无法区分存在缺陷或不安全的代码及其稳健版本。此外,我们初步探索了明确引导检索器识别代码质量的训练方法。通过合成数据集,我们展示了在不牺牲语义相关性的前提下,各类模型在质量感知指标上的显著提升。下游代码生成实验进一步验证了方法的有效性。总体而言,本工作凸显了将质量信号融入代码检索系统的重要性,为构建更可信、更稳健的软件开发工具奠定基础。