Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

翻译：图像搜索是多媒体与计算机视觉领域的一项关键任务，在互联网搜索到医学诊断等多个领域均有应用。传统图像搜索系统通过接收文本或视觉查询，从数据库中检索最相关的前几个候选结果。然而，现有方法多依赖单轮交互流程，这可能导致潜在的不准确性并限制召回率。这些方法还面临词汇不匹配与语义鸿沟等挑战，从而制约了其整体性能。为解决这些问题，我们提出了一种交互式图像检索系统，能够在多轮交互中根据用户的相关性反馈优化查询。该系统集成了基于视觉语言模型（VLM）的图像描述生成器，以提升基于文本的查询质量，使得每次迭代都能产生信息更丰富的查询。此外，我们引入了一种基于大语言模型（LLM）的去噪器，用于优化基于文本的查询扩展，从而减轻描述模型生成的图像描述中的不准确性。为评估我们的系统，我们通过将MSR-VTT视频检索数据集改编为图像检索任务，构建了一个新数据集，为每个查询提供了多个相关的真实图像。通过全面的实验，我们验证了所提系统相较于基线方法的有效性，在召回率上实现了显著的10%提升，达到了最先进的性能。我们的贡献包括：开发了一种创新的交互式图像检索系统、整合了基于LLM的去噪器、精心设计了评估数据集，以及进行了彻底的实验验证。