Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.
翻译:交互式文本-图像检索(I-TIR)的最新进展利用扩散模型弥合文本信息需求与待检索图像之间的模态鸿沟,从而提升了检索效果。然而现有框架通过简单的嵌入加法融合用户反馈的多模态视图。本研究表明,这种静态且无差异化的融合会不加区分地引入扩散模型产生的生成噪声,导致高达55.62%的样本性能下降。我们进一步提出ADaFuSE(自适应扩散-文本融合语义感知专家),一种轻量级融合模型,专为对齐与校准扩散增强型I-TIR的多模态视图而设计,可即插即用于现有框架且无需修改主干编码器。具体而言,我们引入双分支融合机制:采用自适应门控分支动态平衡模态可靠性,同时结合语义感知混合专家分支捕捉细粒度跨模态细微差异。通过四个标准I-TIR基准的全面评估,ADaFuSE实现了最先进的性能,在仅增加5.29%参数的情况下,Hits@10指标最高超越DAR达3.49%,并对噪声和更长交互查询展现出更强的鲁棒性。这些结果表明,将生成式增强与原理性融合相结合,为交互式检索提供了一种简单且可泛化的微调替代方案。