In mixed-initiative conversational search systems, clarifying questions are used to help users who struggle to express their intentions in a single query. These questions aim to uncover user's information needs and resolve query ambiguities. We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information. Therefore, we propose to add images to clarifying questions and formulate the novel task of asking multimodal clarifying questions in open-domain, mixed-initiative conversational search systems. To facilitate research into this task, we collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images. We also propose a multimodal query clarification model named Marto and adopt a prompt-based, generative fine-tuning strategy to perform the training of different stages with different prompts. Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase. Experimental results indicate that the addition of images leads to significant improvements of up to 90% in retrieval performance when selecting the relevant images. Extensive analyses are also performed to show the superiority of Marto compared with discriminative baselines in terms of effectiveness and efficiency.
翻译:在混合主动对话式搜索系统中,澄清问题被用于帮助那些难以通过单一查询表达意图的用户。这些问题旨在揭示用户的信息需求并解决查询歧义。我们假设,在多模态信息相关的场景下,利用非文本信息能够改进澄清过程。因此,我们提出在澄清问题中添加图像,并定义了一个新任务:在开放域、混合主动对话式搜索系统中提出多模态澄清问题。为促进该任务的研究,我们收集了一个名为Melon的数据集,包含超过4000个多模态澄清问题,并附有超过14000张图像。我们还提出了一种名为Marto的多模态查询澄清模型,并采用基于提示的生成式微调策略,利用不同提示对不同阶段进行训练。我们通过多项分析来理解多模态内容在查询澄清阶段的重要性。实验结果表明,在选择相关图像时,添加图像使检索性能显著提升高达90%。此外,我们还进行了广泛分析,证明Marto在效果和效率方面均优于判别式基线模型。