Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.
翻译:大型语言模型(LLM)的规模化已被证明对提升检索模型性能具有巨大潜力;然而,先前研究主要集中于采用对比损失(CL)训练的稠密检索,忽视了其他检索范式与优化技术(如稀疏检索与知识蒸馏(KD))的规模化行为。本研究系统性地比较了不同检索范式(稀疏 vs. 稠密)与微调目标(CL vs. KD vs. 二者结合)如何影响不同模型规模下的检索性能。以MSMARCO段落作为训练数据集,采用解码器专用LLM(Llama-3系列:1B、3B、8B)及固定计算预算,我们在领域内(MSMARCO、TREC DL)与领域外(BEIR)基准测试中评估了多种训练配置。主要发现如下:(1)规模化效应仅在采用CL时显著呈现,更大模型可获得显著性能提升,而KD训练模型改进甚微,在1B、3B与8B规模下表现相近。(2)稀疏检索模型在领域内(MSMARCO、TREC DL)与领域外(BEIR)基准测试中均持续优于稠密检索,且对不完善监督信号表现出更强的鲁棒性。(3)我们成功通过CL与KD损失的组合在8B规模下实现了稀疏检索模型的规模化,并在所有评估集中取得了最先进(SOTA)的结果。