An In-depth Investigation of User Response Simulation for Conversational Search

Conversational search has seen increased recent attention in both the IR and NLP communities. It seeks to clarify and solve users' search needs through multi-turn natural language interactions. However, most existing systems are trained and demonstrated with recorded or artificial conversation logs. Eventually, conversational search systems should be trained, evaluated, and deployed in an open-ended setting with unseen conversation trajectories. A key challenge is that training and evaluating such systems both require a human-in-the-loop, which is expensive and does not scale. One strategy is to simulate users, thereby reducing the scaling costs. However, current user simulators are either limited to only responding to yes-no questions from the conversational search system or unable to produce high-quality responses in general. In this paper, we show that existing user simulation systems could be significantly improved by a smaller finetuned natural language generation model. However, rather than merely reporting it as the new state-of-the-art, we consider it a strong baseline and present an in-depth investigation of simulating user response for conversational search. Our goal is to supplement existing work with an insightful hand-analysis of unsolved challenges by the baseline and propose our solutions. The challenges we identified include (1) a blind spot that is difficult to learn, and (2) a specific type of misevaluation in the standard setup. We propose a new generation system to effectively cover the training blind spot and suggest a new evaluation setup to avoid misevaluation. Our proposed system leads to significant improvements over existing systems and large language models such as GPT-4. Additionally, our analysis provides insights into the nature of user simulation to facilitate future work.

翻译：对话搜索近年来在信息检索和自然语言处理领域均受到越来越多的关注，旨在通过多轮自然语言交互来澄清并解决用户的搜索需求。然而，现有大多数系统均使用录制的或人工生成的对话日志进行训练和验证。最终，对话搜索系统应在开放环境下，面对未见过的对话轨迹进行训练、评估和部署。关键挑战在于，训练和评估此类系统均需引入人工参与，这不仅成本高昂，且难以规模化扩展。一种策略是模拟用户行为，从而降低规模化成本。然而，当前用户模拟器要么仅能回答对话搜索系统提出的"是/否"问题，要么总体上无法生成高质量响应。本文证明，通过一个较小的微调自然语言生成模型，现有用户模拟系统可得到显著改进。但本文并非仅将其作为新最优方法报告，而是将其视为强基线，并对对话搜索中的用户响应模拟进行深入研究。我们旨在通过对手基线未解决的挑战进行细致分析，补充现有工作，并提出解决方案。识别出的挑战包括：(1) 难以学习的盲区，(2) 标准设置下特定类型的评估错误。我们提出新型生成系统以有效覆盖训练盲区，并建议采用新的评估设置以避免评估错误。与现有系统及GPT-4等大型语言模型相比，本文提出的系统性能显著提升。此外，我们的分析深入揭示了用户模拟的本质，为未来研究提供借鉴。