Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/predicir.
翻译:零样本组合图像检索涉及跨领域、场景、物体和属性的多样化视觉内容操控意图任务。其核心挑战在于依据操控文本修改参考图像以精确检索目标图像,尤其在参考图像缺失关键目标内容时。本文提出一种新颖的基于预测的映射网络PrediCIR,通过在潜在空间自适应预测参考图像中缺失的目标视觉内容再进行映射,以实现精确的零样本组合图像检索。具体而言,世界视图生成模块首先通过省略目标视图的特定视觉内容构建源视图,并结合从现有图像-标题对衍生的操控意图动作。随后,目标内容预测模块将世界模型作为预测器进行训练,在潜在空间中根据用户文本操控意图自适应预测缺失的视觉信息。这两个模块将包含预测相关信息的图像映射为伪词标记,无需额外监督。我们的模型在六项零样本组合图像检索任务中展现出强大的泛化能力,相比最佳方法获得1.73%至4.45%的稳定显著性能提升,并在零样本组合图像检索任务中取得了新的最优结果。代码发布于https://github.com/Pter61/predicir。