Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.
翻译:近期研究表明,通过大型语言模型(LLMs)生成假想文档作为查询扩展,可显著提升信息检索系统的性能。然而,由于LLMs自身知识有限,扩展内容与检索语料库之间存在的错位问题,会导致幻觉现象与信息滞后等挑战。受伪相关反馈(PRF)启发,我们提出基于语料库引导的查询扩展方法(CSQE),旨在促进语料库内嵌知识的整合。CSQE利用LLMs的相关性评估能力,系统性地识别初始检索文档中的关键语句。这些源自语料库的文本随后与LLM知识驱动的扩展相结合,共同优化查询扩展,从而提升查询与目标文档间的相关性预测。大量实验表明,CSQE无需任何训练即可展现卓越性能,尤其在LLMs缺乏相关知识的查询场景中表现突出。