Query expansion is a commonly-used technique in many search systems to better represent users' information needs with additional query terms. Existing studies for this task usually propose to expand a query with retrieved or generated contextual documents. However, both types of methods have clear limitations. For retrieval-based methods, the documents retrieved with the original query might not be accurate enough to reveal the search intent, especially when the query is brief or ambiguous. For generation-based methods, existing models can hardly be trained or aligned on a particular corpus, due to the lack of corpus-specific labeled data. In this paper, we propose a novel Large Language Model (LLM) based mutual verification framework for query expansion, which alleviates the aforementioned limitations. Specifically, we first design a query-query-document generation pipeline, which can effectively leverage the contextual knowledge encoded in LLMs to generate sub-queries and corresponding documents from multiple perspectives. Next, we employ a mutual verification method for both generated and retrieved contextual documents, where 1) retrieved documents are filtered with the external contextual knowledge in generated documents, and 2) generated documents are filtered with the corpus-specific knowledge in retrieved documents. Overall, the proposed method allows retrieved and generated documents to complement each other to finalize a better query expansion. We conduct extensive experiments on three information retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO. The results demonstrate that our method outperforms other baselines significantly.
翻译:查询扩展是许多搜索系统中常用的技术,通过补充查询词条来更准确地表达用户信息需求。现有研究通常采用检索或生成的上下文文档进行查询扩展,但两类方法均存在明显局限。基于检索的方法中,原始查询检索到的文档可能难以准确揭示搜索意图(尤其当查询简短或存在歧义时);基于生成的方法中,由于缺乏语料库特定标注数据,现有模型难以在特定语料库上进行训练或对齐。本文提出一种基于大语言模型(LLM)的互验证查询扩展框架,有效缓解上述局限。具体而言,我们首先设计查询-查询-文档生成流水线,充分利用LLM编码的上下文知识从多维度生成子查询及其对应文档;其次采用互验证方法处理生成文档与检索文档:1)利用生成文档中的外部上下文知识过滤检索文档,2)利用检索文档中的语料库特定知识过滤生成文档。通过这种方法,检索文档与生成文档能够相互补充以实现更优的查询扩展。我们在TREC-DL-2020、TREC-COVID和MSMARCO三个信息检索数据集上进行大量实验,结果表明本方法显著优于其他基线模型。