Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the established evaluation methodology for ranking-based ad hoc retrieval is not suited for the reliable and reproducible evaluation of generated responses. To lay a foundation for developing new evaluation methods for generative retrieval systems, we survey the relevant literature from the fields of information retrieval and natural language processing, identify search tasks and system architectures in generative retrieval, develop a new user model, and study its operationalization.
翻译:近年来,大型语言模型的进展使得生成式检索系统的开发成为可能。与传统文档排序不同,生成式检索系统通常直接返回基于证据的生成文本作为查询响应。量化此类文本响应的效用对于恰当评估生成式即席检索至关重要。然而,基于排序的传统即席检索评估方法并不适用于生成响应的可靠与可复现评估。为构建生成式检索系统的新型评估方法奠定基础,本文综合信息检索与自然语言处理领域的相关文献,识别生成式检索中的搜索任务与系统架构,构建新的用户模型,并研究其可操作性。