Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route -- we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.
翻译:扩散概率模型因其出色的生成效果及支持跨模态合成,已成为条件生成领域的主流方法。条件合成的一个关键需求是实现条件输入与生成输出之间的高度对应性。现有方法大多通过将先验信息纳入变分下界来隐式学习这种关联。本研究另辟蹊径——通过最大化输入与输出的互信息显式增强其关联性。为此,我们提出条件离散对比扩散损失,并设计两种对比扩散机制将其有效融入去噪过程。通过将该损失与经典变分目标相结合,首次将扩散训练与对比学习统一框架。我们在多种多模态条件合成任务中验证了该方法的效果:舞蹈生成音乐、文本生成图像以及类别条件图像生成。在每个任务中,该方法均增强了输入输出对应性,并实现了更优或具有竞争力的整体合成质量。此外,所提方法提升了扩散模型的收敛效率,在两个基准测试中减少超过35%的所需扩散步数,显著提高了推理速度。