Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to "hallucinating" content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 23% and cutting the index size by 33%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration at https://github.com/terrierteam/pyterrier_doc2query.
翻译:Doc2Query——即在索引前使用序列到序列模型扩展文档内容的过程——已成为提升搜索引擎第一阶段检索效果的重要技术。然而,序列到序列模型易出现"幻觉"现象,即生成源文本中不存在的虚假内容。我们认为Doc2Query确实存在幻觉问题,这最终会损害检索效果并导致索引规模膨胀。本研究探索了在索引前过滤这些有害查询的技术。我们发现,使用相关性模型来剔除低质量查询可以将Doc2Query的检索效果提升高达16%,同时平均查询执行时间降低23%,索引大小缩减33%。我们已在https://github.com/terrierteam/pyterrier_doc2query 上发布代码、数据和实时演示,以促进成果复现与深入探索。