A peculiarity of conversational search systems is that they involve mixed-initiatives such as system-generated query clarifying questions. Evaluating those systems at a large scale on the end task of IR is very challenging, requiring adequate datasets containing such interactions. However, current datasets only focus on either traditional ad-hoc IR tasks or query clarification tasks, the latter being usually seen as a reformulation task from the initial query. The only two datasets known to us that contain both document relevance judgments and the associated clarification interactions are Qulac and ClariQ. Both are based on the TREC Web Track 2009-12 collection, but cover a very limited number of topics (237 topics), far from being enough for training and testing conversational IR models. To fill the gap, we propose a methodology to automatically build large-scale conversational IR datasets from ad-hoc IR datasets in order to facilitate explorations on conversational IR. Our methodology is based on two processes: 1) generating query clarification interactions through query clarification and answer generators, and 2) augmenting ad-hoc IR datasets with simulated interactions. In this paper, we focus on MsMarco and augment it with query clarification and answer simulations. We perform a thorough evaluation showing the quality and the relevance of the generated interactions for each initial query. This paper shows the feasibility and utility of augmenting ad-hoc IR datasets for conversational IR.
翻译:对话搜索系统的一个特性是涉及混合主动机制,例如系统生成的查询澄清问题。在大规模信息检索端任务上评估这些系统极具挑战性,需要包含此类交互的充足数据集。然而,现有数据集仅聚焦于传统特定检索任务或查询澄清任务,后者通常被视为原始查询的重述任务。据我们所知,同时包含文档相关性判断和关联澄清交互的数据集仅有Qulac和ClariQ。两者均基于TREC Web Track 2009-12语料库,但仅覆盖有限主题(237个),远不足以训练和测试对话式信息检索模型。为填补这一空白,我们提出一种从特定检索数据集自动构建大规模对话式信息检索数据集的方法,以推动对话式信息检索领域的探索。该方法基于两个流程:1)通过查询澄清和答案生成器生成查询澄清交互;2)通过模拟交互增强特定检索数据集。本文聚焦于MsMarco数据集,为其补充查询澄清与答案模拟。通过全面评估,验证了为每个初始查询生成的交互质量与相关性。本研究表明增强特定检索数据集用于对话式信息检索的可行性与实用性。