Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,'' exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.
翻译:解码器专用的大语言模型正日益取代BERT类架构,成为密集检索的主干模型,在实现显著性能提升的同时获得广泛应用。然而,这类基于大语言模型的检索器的鲁棒性仍鲜有研究。本文首次从通用性和稳定性两个互补视角,对当前最先进的基于开源大语言模型的密集检索器鲁棒性进行系统研究。在通用性方面,我们基于涵盖30个数据集的四项基准评估检索效能,采用线性混合效应模型估算边际平均性能,并分离内在模型能力与数据集异质性。分析表明,指令微调模型总体表现优异,但针对复杂推理优化的模型常存在"专业化代价",在更广泛场景中通用性有限。在稳定性方面,我们评估模型对意外查询变化(如释义、拼写错误)及恶意对抗攻击(如语料污染)的抗性。研究发现,与仅编码器基线模型相比,基于大语言模型的检索器对拼写错误和语料污染的鲁棒性有所提升,但仍易受同义词替换等语义扰动影响。进一步分析表明,嵌入几何特性(如角度均匀性)可为词汇稳定性提供预测信号,且模型规模扩展通常可提升鲁棒性。这些发现为未来鲁棒性感知的检索器设计与规范化基准测试提供了指导。相关代码已开源发布于https://github.com/liyongkang123/Robust_LLM_Retriever_Eval。