Chatting Makes Perfect: Chat-based Image Retrieval

Chats emerge as an effective user-friendly approach for information retrieval, and are successfully employed in many domains, such as customer service, healthcare, and finance. However, existing image retrieval approaches typically address the case of a single query-to-image round, and the use of chats for image retrieval has been mostly overlooked. In this work, we introduce ChatIR: a chat-based image retrieval system that engages in a conversation with the user to elicit information, in addition to an initial query, in order to clarify the user's search intent. Motivated by the capabilities of today's foundation models, we leverage Large Language Models to generate follow-up questions to an initial image description. These questions form a dialog with the user in order to retrieve the desired image from a large corpus. In this study, we explore the capabilities of such a system tested on a large dataset and reveal that engaging in a dialog yields significant gains in image retrieval. We start by building an evaluation pipeline from an existing manually generated dataset and explore different modules and training strategies for ChatIR. Our comparison includes strong baselines derived from related applications trained with Reinforcement Learning. Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds, compared to 75% when questions are asked by humans, and 64% for a single shot text-to-image retrieval. Extensive evaluations reveal the strong capabilities and examine the limitations of CharIR under different settings. Project repository is available at https://github.com/levymsn/ChatIR.

翻译：聊天作为一种用户友好的信息检索方式逐渐兴起，并成功应用于客服、医疗和金融等多个领域。然而，现有图像检索方法通常处理单次查询到图像的轮次，而利用聊天进行图像检索在很大程度上被忽视了。在本研究中，我们引入了ChatIR：一种基于聊天的图像检索系统，它在初始查询之外与用户进行对话以获取信息，从而明确用户的搜索意图。受当前基础模型能力的启发，我们利用大型语言模型针对初始图像描述生成后续问题。这些问题与用户形成对话，以便从大型语料库中检索所需图像。在本研究中，我们探索了该系统在大型数据集上的测试能力，并揭示参与对话能显著提升图像检索效果。我们首先从现有手动生成的数据集构建评估流水线，并探索ChatIR的不同模块和训练策略。我们的比较包括从相关应用（通过强化学习训练）中衍生的强基线模型。我们的系统能够从5万张图像池中检索目标图像，在5轮对话后成功率超过78%，而人类提问时的成功率为75%，单次文本到图像检索的成功率为64%。广泛评估揭示了ChatIR在不同设置下的强大能力，并审视了其局限性。项目代码库见：https://github.com/levymsn/ChatIR。