Query expansion is a commonly-used technique in many search systems to better represent users' information needs with additional query terms. Existing studies for this task usually propose to expand a query with retrieved or generated contextual documents. However, both types of methods have clear limitations. For retrieval-based methods, the documents retrieved with the original query might not be accurate enough to reveal the search intent, especially when the query is brief or ambiguous. For generation-based methods, existing models can hardly be trained or aligned on a particular corpus, due to the lack of corpus-specific labeled data. In this paper, we propose a novel Large Language Model (LLM) based mutual verification framework for query expansion, which alleviates the aforementioned limitations. Specifically, we first design a query-query-document generation pipeline, which can effectively leverage the contextual knowledge encoded in LLMs to generate sub-queries and corresponding documents from multiple perspectives. Next, we employ a mutual verification method for both generated and retrieved contextual documents, where 1) retrieved documents are filtered with the external contextual knowledge in generated documents, and 2) generated documents are filtered with the corpus-specific knowledge in retrieved documents. Overall, the proposed method allows retrieved and generated documents to complement each other to finalize a better query expansion. We conduct extensive experiments on three information retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO. The results demonstrate that our method outperforms other baselines significantly.
翻译:摘要:查询扩展是许多搜索系统中通过补充查询词以更准确表征用户信息需求的常用技术。现有研究通常采用检索或生成式上下文文档进行查询扩展,但两种方法均存在明显局限:基于检索的方法中,原始查询所获取的文档可能难以精确揭示搜索意图(尤其当查询简短或模糊时);基于生成的方法则因缺乏特定语料标注数据,现有模型难以针对特定语料库进行训练或对齐。本文提出一种新颖的基于大语言模型(LLM)的互证框架用于查询扩展,有效缓解了上述局限。具体而言,我们首先设计查询-查询-文档生成流水线,通过多角度利用LLM编码的上下文知识生成子查询及其对应文档;继而采用互证方法对生成文档与检索文档进行双向筛选:1)利用生成文档的外部上下文知识过滤检索文档,2)利用检索文档的语料特有知识过滤生成文档。通过让检索文档与生成文档相互补充,最终实现更优的查询扩展效果。我们在TREC-DL-2020、TREC-COVID和MSMARCO三个信息检索数据集上开展广泛实验,结果表明本方法显著优于其他基线模型。