With the increasing popularity of conversational search, how to evaluate the performance of conversational search systems has become an important question in the IR community. Existing works on conversational search evaluation can mainly be categorized into two streams: (1) constructing metrics based on semantic similarity (e.g. BLUE, METEOR and BERTScore), or (2) directly evaluating the response ranking performance of the system using traditional search methods (e.g. nDCG, RBP and nERR). However, these methods either ignore the information need of the user or ignore the mixed-initiative property of conversational search. This raises the question of how to accurately model user satisfaction in conversational search scenarios. Since explicitly asking users to provide satisfaction feedback is difficult, traditional IR studies often rely on the Cranfield paradigm (i.e., third-party annotation) and user behavior modeling to estimate user satisfaction in search. However, the feasibility and effectiveness of these two approaches have not been fully explored in conversational search. In this paper, we dive into the evaluation of conversational search from the perspective of user satisfaction. We build a novel conversational search experimental platform and construct a Chinese open-domain conversational search behavior dataset containing rich annotations and search behavior data. We also collect third-party satisfaction annotation at the session-level and turn-level, to investigate the feasibility of the Cranfield paradigm in the conversational search scenario. Experimental results show both some consistency and considerable differences between the user satisfaction annotations and third-party annotations. We also propose dialog continuation or ending behavior models (DCEBM) to capture session-level user satisfaction based on turn-level information.
翻译:随着对话式搜索的日益普及,如何评估对话式搜索系统的性能已成为信息检索领域的重要问题。现有关于对话式搜索评估的研究主要分为两类:(1) 基于语义相似度构建评估指标(如BLUE、METEOR和BERTScore),或(2) 采用传统搜索方法直接评估系统的响应排序性能(如nDCG、RBP和nERR)。然而,这些方法要么忽略了用户的信息需求,要么忽视了对话式搜索的混合主动特性。这引出了如何准确建模对话式搜索场景中用户满意度的问题。由于直接要求用户提供满意度反馈存在困难,传统信息检索研究通常依赖克兰菲尔德范式(即第三方标注)和用户行为建模来估计搜索满意度。然而,这两种方法在对话式搜索中的可行性和有效性尚未得到充分探索。本文从用户满意度视角深入探究对话式搜索评估问题:我们构建了新型对话式搜索实验平台,建立了包含丰富标注和搜索行为数据的中文开放域对话式搜索行为数据集,同时收集了会话级和轮次级的第三方满意度标注,以验证克兰菲尔德范式在对话式搜索场景中的可行性。实验结果表明,用户满意度标注与第三方标注既存在一致性又存在显著差异。我们还提出了基于轮次级信息捕捉会话级用户满意度的对话持续/终止行为模型(DCEBM)。