Diffusion models for Text-to-Image (T2I) conditional generation have seen tremendous success recently. Despite their success, accurately capturing user intentions with these models still requires a laborious trial and error process. This challenge is commonly identified as a model alignment problem, an issue that has attracted considerable attention by the research community. Instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models to steer image generation, in this work we present a novel method that relies on an information-theoretic alignment measure. In a nutshell, our method uses self-supervised fine-tuning and relies on point-wise mutual information between prompts and images to define a synthetic training set to induce model alignment. Our comparative analysis shows that our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI and a lightweight fine-tuning strategy.
翻译:文本到图像条件生成的扩散模型近期取得了巨大成功。尽管成果显著,但利用这些模型准确捕捉用户意图仍需要繁琐的试错过程。这一挑战通常被定义为模型对齐问题,已引起研究界的广泛关注。不同于依赖提示词的细粒度语言分析、人工标注或辅助视觉语言模型来引导图像生成,本研究提出了一种基于信息论对齐度量的新方法。简而言之,我们的方法采用自监督微调策略,利用提示词与图像间的逐点互信息来构建合成训练集,从而诱导模型对齐。对比分析表明,本方法达到或超越了现有最优技术水平,且仅需预训练的降噪网络来估计互信息,并配合轻量级微调策略即可实现。