An In-depth Investigation of User Response Simulation for Conversational Search

Conversational search has seen increased recent attention in both the IR and NLP communities. It seeks to clarify and solve a user's search need through multi-turn natural language interactions. However, most existing systems are trained and demonstrated with recorded or artificial conversation logs. Eventually, conversational search systems should be trained, evaluated, and deployed in an open-ended setting with unseen conversation trajectories. A key challenge is that training and evaluating such systems both require a human-in-the-loop, which is expensive and does not scale. One strategy for this is to simulate users, thereby reducing the scaling costs. However, current user simulators are either limited to only respond to yes-no questions from the conversational search system, or unable to produce high quality responses in general. In this paper, we show that current state-of-the-art user simulation system could be significantly improved by replacing it with a smaller but advanced natural language generation model. But rather than merely reporting this new state-of-the-art, we present an in-depth investigation of the task of simulating user response for conversational search. Our goal is to supplement existing works with an insightful hand-analysis of what challenges are still unsolved by the advanced model, as well as to propose our solutions for them. The challenges we identified include (1) dataset noise, (2) a blind spot that is difficult for existing models to learn, and (3) a specific type of misevaluation in the standard empirical setup. Except for the dataset noise issue, we propose solutions to cover the training blind spot and to avoid the misevaluation. Our proposed solutions lead to further improvements. Our best system improves the previous state-of-the-art significantly.

翻译：对话式搜索近年来在信息检索和自然语言处理领域受到了越来越多的关注，它旨在通过多轮自然语言交互来澄清并解决用户的搜索需求。然而，现有的大多数系统都是基于记录或人工生成的对话日志进行训练和演示的。最终，对话式搜索系统应在开放环境中使用未见过的对话轨迹进行训练、评估和部署。一个关键挑战是，训练和评估此类系统都需要人类参与，这既昂贵又难以扩展。一种策略是模拟用户，从而降低扩展成本。然而，当前的用户模拟器要么仅能回答对话式搜索系统的“是/否”问题，要么普遍无法生成高质量的响应。在本文中，我们表明，当前最先进的用户模拟系统可以通过将其替换为更小但更先进的自然语言生成模型来显著改进。但本文并非仅仅报告这一新的最优结果，而是对模拟对话式搜索用户响应的任务进行了深入探究。我们的目标是通过对先进模型仍未解决的挑战进行富有洞察力的人工分析，来补充现有工作，并针对这些挑战提出我们的解决方案。我们识别的挑战包括：(1) 数据集噪声，(2) 现有模型难以学习的盲区，以及 (3) 标准实验设置中存在的一种特定类型的评估失误。除数据集噪声问题外，我们提出了解决方案来覆盖训练盲区并避免评估失误。我们提出的解决方案带来了进一步的改进。我们的最佳系统显著提升了先前的最优水平。