Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is a promising paradigm that improves retrieval performance by generating query images via diffusion models and using them as additional ``views'' of the user's intent. However, these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues that conflict with the original query text. Indeed, we empirically demonstrate that these hallucinated cues can substantially degrade DAI-TIR performance. To address this, we propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a hallucination-robust training framework that casts DAI-TIR as joint optimization over representations of query intent and the target image. DMCL introduces semantic-consistency and diffusion-aware contrastive objectives to align textual and diffusion-generated query views while suppressing hallucinated query signals. This yields an encoder that acts as a semantic filter, effectively mapping hallucinated cues into a null space, improving robustness to spurious cues and better representing the user's intent. Attention visualization and geometric embedding-space analyses corroborate this filtering behavior. Across five standard benchmarks, DMCL delivers consistent improvements in multi-round Hits@10, reaching as high as 7.37\% over prior fine-tuned and zero-shot baselines, which indicates it is a general and robust training framework for DAI-TIR.
翻译:扩散增强交互式文本-图像检索(DAI-TIR)是一种有前景的范式,它通过扩散模型生成查询图像并将其作为用户意图的额外“视图”来提高检索性能。然而,这些生成视图可能不正确,因为扩散生成可能会引入与原始查询文本相冲突的幻觉视觉线索。事实上,我们通过实验证明,这些幻觉线索会显著降低DAI-TIR的性能。为解决此问题,我们提出了扩散感知多视图对比学习(DMCL),这是一种抗幻觉的训练框架,将DAI-TIR建模为对查询意图与目标图像表示的联合优化。DMCL引入了语义一致性和扩散感知对比目标,以对齐文本和扩散生成的查询视图,同时抑制幻觉查询信号。这产生了一个充当语义过滤器的编码器,能有效地将幻觉线索映射到零空间,从而提高了对虚假线索的鲁棒性,并更好地表征了用户意图。注意力可视化和几何嵌入空间分析证实了这种过滤行为。在五个标准基准测试中,DMCL在多轮Hits@10指标上实现了持续改进,相较于先前微调和零样本基线最高提升达7.37%,这表明它是DAI-TIR的一个通用且鲁棒的训练框架。