Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework

Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user's true visual intent, significantly outperforming baselines in preference alignment.

翻译：基于扩散模型的文本到图像生成已取得显著成功。然而，用户通常具有清晰的视觉意图，却难以用语言精确表达，导致提示模糊和图像生成失准。现有方法难以弥合这一鸿沟，通常依赖于高负荷的文本对话、不透明的黑盒推理或昂贵的微调，无法同时实现低认知负荷、可解释的偏好推断，并保持免训练和模型无关的特性。为此，我们提出RFD，一个将信息检索中的相关性反馈机制适配到扩散模型的交互式框架。在RFD中，用户以隐式的多选视觉反馈替代显式的文本对话，以最小化认知负荷，轻松表达复杂、多维的偏好。为了将反馈转化为精确的生成指导，我们构建了一个专家策划的特征库，并引入了一种基于信息论的加权累积偏好分析。这种白盒方法从当前轮次的反馈中计算偏好并增量累积，避免了历史交互的简单拼接，防止了长上下文导致的推理退化。此外，RFD采用概率采样机制进行提示重构，以平衡利用与探索，防止输出同质化。至关重要的是，RFD完全在外部文本空间中运行，使其作为一种通用的即插即用解决方案，严格保持免训练和模型无关的特性。大量实验表明，RFD能有效捕捉用户的真实视觉意图，在偏好对齐方面显著优于基线方法。