InfoCIR: Multimedia Analysis for Composed Image Retrieval

from arxiv, 9+2 pages, 8 figures. Accepted for publication in IEEE PacificVis 2026 (Conference Track). Interactive composed image retrieval (CIR) and ranking explanation

Composed Image Retrieval (CIR) allows users to search for images by combining a reference image with a text prompt that describes desired modifications. While vision-language models like CLIP have popularized this task by embedding multiple modalities into a joint space, developers still lack tools that reveal how these multimodal prompts interact with embedding spaces and why small wording changes can dramatically alter the results. We present InfoCIR, a visual analytics system that closes this gap by coupling retrieval, explainability, and prompt engineering in a single, interactive dashboard. InfoCIR integrates a state-of-the-art CIR back-end (SEARLE arXiv:2303.15247) with a six-panel interface that (i) lets users compose image + text queries, (ii) projects the top-k results into a low-dimensional space using Uniform Manifold Approximation and Projection (UMAP) for spatial reasoning, (iii) overlays similarity-based saliency maps and gradient-derived token-attribution bars for local explanation, and (iv) employs an LLM-powered prompt enhancer that generates counterfactual variants and visualizes how these changes affect the ranking of user-selected target images. A modular architecture built on Plotly-Dash allows new models, datasets, and attribution methods to be plugged in with minimal effort. We argue that InfoCIR helps diagnose retrieval failures, guides prompt enhancement, and accelerates insight generation during model development. All source code allowing for a reproducible demo is available at https://github.com/giannhskp/InfoCIR.

翻译：组合图像检索（Composed Image Retrieval, CIR）允许用户通过结合参考图像与描述期望修改的文本提示来搜索图像。尽管如CLIP等视觉-语言模型通过将多模态嵌入到联合空间中推动了该任务的发展，但开发者仍缺乏能够揭示这些多模态提示如何与嵌入空间相互作用、以及为何细微的措辞变化会显著改变检索结果的工具。我们提出了InfoCIR，一个通过将检索、可解释性与提示工程集成于单一交互式仪表盘中来填补这一空白的可视化分析系统。InfoCIR集成了最先进的CIR后端（SEARLE arXiv:2303.15247）与一个六面板界面，该界面能够：（i）让用户组合图像+文本查询；（ii）使用均匀流形近似与投影（Uniform Manifold Approximation and Projection, UMAP）将前k个结果投影至低维空间以进行空间推理；（iii）叠加基于相似性的显著图与基于梯度的词元归因条以提供局部解释；（iv）采用基于大语言模型（LLM）的提示增强器，生成反事实变体并可视化这些变化如何影响用户选定目标图像的排序。基于Plotly-Dash构建的模块化架构允许以最小工作量接入新模型、数据集和归因方法。我们认为InfoCIR有助于诊断检索失败、指导提示增强，并在模型开发过程中加速洞察生成。所有支持可复现演示的源代码均可在https://github.com/giannhskp/InfoCIR获取。