Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, many generative retrieval systems directly return a grounded generated text as an answer to an information need expressed as a query or question. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the established evaluation methodology for ranking-based retrieval is not suited for reliable, repeatable, and reproducible evaluation of generated answers. In this paper, we survey the relevant literature from the fields of information retrieval and natural language processing, we identify search tasks and system architectures in generative retrieval, we develop a corresponding user model, and we study its operationalization. Our analysis provides a foundation and new insights for the evaluation of generative retrieval systems, focusing on ad hoc retrieval.
翻译:大型语言模型的最新进展推动了可行的生成式检索系统的发展。与传统的文档排序不同,许多生成式检索系统会直接返回基于事实的生成文本,作为对以查询或问题形式表达的信息需求的答案。量化文本响应的效用对于恰当评估此类生成式即席检索至关重要。然而,基于排序检索的既定评估方法不适用于对生成答案进行可靠、可重复且可再现的评估。本文中,我们调研了信息检索和自然语言处理领域的相关文献,识别了生成式检索中的搜索任务与系统架构,开发了相应的用户模型,并研究了其可操作化方案。我们的分析为生成式检索系统(尤其关注即席检索)的评估提供了基础与新的见解。