LLM-based query expansion improves retrieval by enriching the original query with additional context. Yet most methods remain generation-driven, producing plausible pseudo-documents or expansions without checking how the target corpus responds. This can introduce retrieval drift, amplify misleading vocabulary, or miss terms that distinguish relevant from non-relevant documents. We argue that effective expansion requires retrieval-grounded feedback, not just single-pass generation or unverified iteration. We introduce ADORE (ADapt, Observe, Relevance Evaluate), an iterative framework that turns retrieval outcomes into feedback for the next expansion. At each round, an LLM generates pseudo-passages, a retriever exposes the corpus response, and a relevance assessor evaluates retrieved documents against the original query. These judgments identify what to reinforce, what remains undercovered, and what to suppress. Across TREC Deep Learning, BEIR, and BRIGHT, ADORE consistently outperforms strong query expansion baselines with notable improvements across nearly all evaluation settings, improving average nDCG@10 by 24.5% over BM25 and 3.6% over the strongest prior query expansion method on BEIR, and by 122.9% over BM25 and 9.2% over the best query expansion baseline on BRIGHT. Our code and data are publicly available.
翻译:基于大语言模型的查询扩展通过为原始查询补充额外上下文来提升检索性能。然而,现有方法仍以生成为导向,虽能生成看似合理的伪文档或扩展项,却未检验目标语料库的真实反馈。这种机制可能导致检索漂移、放大误导性词汇,或遗漏区分相关与非相关文档的关键术语。我们认为,有效的查询扩展需要基于检索锚定的反馈机制,而非单纯的单次生成或未经验证的迭代。我们提出ADORE(自适应观察-相关性评估)框架,该迭代框架将检索结果转化为下一轮扩展的反馈信号。每轮迭代中,大语言模型生成伪段落,检索器揭示语料库响应,相关性评估器依据原始查询对检索文档进行质量评判。这些评判结果可识别需强化的内容、尚未覆盖的主题以及应抑制的噪声。在TREC Deep Learning、BEIR和BRIGHT基准测试中,ADORE在所有评估设置下均显著优于强基线查询扩展方法:在BEIR上,相比BM25和最强基线方法,nDCG@10平均提升24.5%和3.6%;在BRIGHT上,相较BM25和最佳基线方法分别提升122.9%和9.2%。我们的代码与数据已公开。