An In-depth Investigation of User Response Simulation for Conversational Search

Conversational search has seen increased recent attention in both the IR and NLP communities. It seeks to clarify and solve users' search needs through multi-turn natural language interactions. However, most existing systems are trained and demonstrated with recorded or artificial conversation logs. Eventually, conversational search systems should be trained, evaluated, and deployed in an open-ended setting with unseen conversation trajectories. A key challenge is that training and evaluating such systems both require a human-in-the-loop, which is expensive and does not scale. One strategy is to simulate users, thereby reducing the scaling costs. However, current user simulators are either limited to only responding to yes-no questions from the conversational search system or unable to produce high-quality responses in general. In this paper, we show that existing user simulation systems could be significantly improved by a smaller finetuned natural language generation model. However, rather than merely reporting it as the new state-of-the-art, we consider it a strong baseline and present an in-depth investigation of simulating user response for conversational search. Our goal is to supplement existing work with an insightful hand-analysis of unsolved challenges by the baseline and propose our solutions. The challenges we identified include (1) a blind spot that is difficult to learn, and (2) a specific type of misevaluation in the standard setup. We propose a new generation system to effectively cover the training blind spot and suggest a new evaluation setup to avoid misevaluation. Our proposed system leads to significant improvements over existing systems and large language models such as GPT-4. Additionally, our analysis provides insights into the nature of user simulation to facilitate future work.

翻译：近年来，对话搜索在信息检索和自然语言处理两个领域都受到越来越多的关注。它旨在通过多轮自然语言交互来澄清并解决用户的搜索需求。然而，现有大多数系统都是基于录制或人工生成的对话日志进行训练和演示的。最终，对话搜索系统应在开放环境下，面对未见过的对话轨迹进行训练、评估和部署。一个关键挑战是，训练和评估此类系统都需要人工参与，这既昂贵又难以扩展。一种策略是模拟用户，从而降低扩展成本。然而，当前的用户模拟器要么仅限于回应对话搜索系统的“是/否”类问题，要么总体上无法生成高质量的回复。在本文中，我们表明，通过一个较小的微调自然语言生成模型，现有用户模拟系统可以得到显著改进。然而，我们并未仅将其作为新的最先进技术进行报告，而是将其视为一个强大的基线，并对对话搜索中的用户响应模拟进行了深入研究。我们的目标是通过对基线未能解决的挑战进行深入的人工分析来补充现有工作，并提出我们的解决方案。我们识别的挑战包括：(1) 难以学习的盲区；(2) 标准设置中特定类型的错误评估。我们提出了一种新的生成系统来有效覆盖训练盲区，并建议了一种新的评估设置以避免错误评估。我们提出的系统相较于现有系统以及GPT-4等大型语言模型取得了显著改进。此外，我们的分析为用户模拟的本质提供了洞见，以促进未来的研究。