Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.
翻译:大规模代码库的检索是基于大型语言模型的现代软件工程系统的关键组成部分。现有方法主要依赖稠密嵌入模型,而学习的稀疏检索在代码领域仍鲜有探索。然而,将稀疏检索应用于代码面临诸多挑战:子词分割、自然语言查询与代码之间的语义鸿沟、编程语言及子任务的多样性,以及代码文档的长度(可能损害稀疏性和延迟)。我们提出了SPLADE-Code,这是首个专门为代码检索设计的大规模学习的稀疏检索模型系列(参数量从6亿到80亿)。尽管采用了轻量级的单阶段训练流程,SPLADE-Code在参数低于10亿的检索器中达到了最先进的性能(在MTEB代码基准上得分为75.4),并在更大规模上取得具有竞争力的结果(8B参数版本得分为79.0)。我们证实,学习的扩展标记对于弥合词汇匹配与语义匹配至关重要,并提供了延迟分析,表明LSR能够在100万条目的语料库上实现亚毫秒级检索,且效果损失极小。