Chats emerge as an effective user-friendly approach for information retrieval, and are successfully employed in many domains, such as customer service, healthcare, and finance. However, existing image retrieval approaches typically address the case of a single query-to-image round, and the use of chats for image retrieval has been mostly overlooked. In this work, we introduce ChatIR: a chat-based image retrieval system that engages in a conversation with the user to elicit information, in addition to an initial query, in order to clarify the user's search intent. Motivated by the capabilities of today's foundation models, we leverage Large Language Models to generate follow-up questions to an initial image description. These questions form a dialog with the user in order to retrieve the desired image from a large corpus. In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for ChatIR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of CharIR under different settings.
翻译:聊天作为一种用户友好的信息检索方式逐渐兴起,并已成功应用于客户服务、医疗和金融等多个领域。然而,现有的图像检索方法通常仅处理单次查询到图像的交互,而利用聊天进行图像检索的方式在很大程度上被忽视。在本文中,我们提出ChatIR:一种基于对话的图像检索系统,该系统在初始查询之外与用户进行对话以获取额外信息,从而明确用户的搜索意图。受当前基础模型能力的启发,我们利用大语言模型针对初始图像描述生成后续问题。这些问题与用户形成对话,以便从大规模语料库中检索目标图像。在本研究中,我们在大型数据集上探索此类系统的能力,并揭示对话交互在图像检索中带来的显著性能提升。我们首先基于现有手工标注数据集构建评估流程,并探索ChatIR的不同模块与训练策略。我们的比较包括使用强化学习训练的相关应用的强基线方法。该系统能够在5轮对话后,从包含5万张图像的图库中以超过78%的成功率检索到目标图像,而人类提问的成功率为75%,单次文本到图像检索的成功率为64%。广泛的评估揭示了ChatIR在不同设置下的强大能力,并检验了其局限性。