Query expansion has been employed for a long time to improve the accuracy of query retrievers. Earlier works relied on pseudo-relevance feedback (PRF) techniques, which augment a query with terms extracted from documents retrieved in a first stage. However, the documents may be noisy hindering the effectiveness of the ranking. To avoid this, recent studies have instead used Large Language Models (LLMs) to generate additional content to expand a query. These techniques are prone to hallucination and also focus on the LLM usage cost. However, the cost may be dominated by the retrieval in several important practical scenarios, where the corpus is only available via APIs which charge a fee per retrieved document. We propose combining classic PRF techniques with LLMs and create a progressive query expansion algorithm ProQE that iteratively expands the query as it retrieves more documents. ProQE is compatible with both sparse and dense retrieval systems. Our experimental results on four retrieval datasets show that ProQE outperforms state-of-the-art baselines by 37% and is the most cost-effective.
翻译:查询扩展长期以来被用于提升查询检索器的准确性。早期工作依赖于伪相关反馈(PRF)技术,该技术通过从首阶段检索到的文档中提取术语来扩充查询。然而,这些文档可能包含噪声,从而影响排序的有效性。为避免这一问题,近期研究转而使用大型语言模型(LLMs)生成额外内容以扩展查询。这类技术易产生幻觉,且主要关注LLM的使用成本。然而,在多个重要的实际场景中,成本可能由检索过程主导——当语料库仅能通过按检索文档数量收费的API访问时尤为如此。我们提出将经典PRF技术与LLMs相结合,设计了一种渐进式查询扩展算法ProQE,该算法在检索更多文档的同时迭代扩展查询。ProQE兼容稀疏和密集检索系统。在四个检索数据集上的实验结果表明,ProQE相较于最先进的基线方法性能提升37%,且成本效益最优。