Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.
翻译:使用大型语言模型进行查询或文档扩展可提升信息检索的泛化能力。然而,尚不清楚这些技术是否具有普适性优势,抑或仅在特定场景下有效,例如针对特定的检索模型、数据集领域或查询类型。为回答该问题,我们首次对基于语言模型的扩展方法进行了全面分析。研究发现,检索器性能与扩展收益之间存在显著的负相关关系:扩展可提升弱模型的表现,但通常会损害强模型的性能。我们验证了这一趋势在十一种扩展技术、十二个具有多样分布偏移的数据集以及二十四种检索模型中的一致性。通过定性错误分析,我们假设尽管扩展提供了额外信息(可能提升召回率),但其引入的额外噪声导致难以区分最相关的文档(从而引入假阳性结果)。实验结果表明以下策略:在弱模型或目标数据集与训练语料格式存在显著差异时使用扩展;否则应避免扩展以保持相关性信号的清晰性。