When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.

翻译：使用大型语言模型进行查询或文档扩展能够改善信息检索中的泛化能力。然而，这些技术是否具有普遍适用性，抑或仅在特定设置（如特定检索模型、数据集领域或查询类型）下才有效，目前尚不清楚。为解答这一问题，我们首次对基于语言模型的扩展技术进行了全面分析。研究发现，检索器性能与扩展带来的增益之间存在强烈的负相关：扩展提升了较弱模型的分数，但通常会对较强模型产生负面影响。我们证明这一趋势在涵盖11种扩展技术、12个具有多样分布偏移的数据集以及24个检索模型中保持一致。通过定性错误分析，我们推测：尽管扩展提供了额外信息（可能提升召回率），但同时也增加了额外噪声，使得难以区分最相关的文档（从而引入误报）。我们的结果表明以下策略：对较弱模型使用扩展，或当目标数据集在格式上与训练语料存在显著差异时使用扩展；否则，应避免使用扩展以保持相关性信号的清晰性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/