Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.
翻译:跨模态数据翻译在图像计算领域引起了广泛关注。深度生成模型(如生成对抗网络)在解决此类问题上展示了性能提升。然而,作为图像翻译中的一项基本挑战,零样本学习跨模态数据保真翻译问题仍未得到解决。本文提出了一种新的无监督零样本学习方法,名为互信息引导扩散跨模态数据翻译模型(MIDiffusion),该模型学习将未见过的源数据翻译到目标域。MIDiffusion利用基于分数匹配的生成模型,学习目标域中的先验知识。我们提出了一种可微的局部互信息层(LMI),用于调节迭代去噪采样过程。LMI在统计域中捕捉相同的跨模态特征以指导扩散过程;因此,当源域发生变化时,我们的方法无需重新训练,因为它不依赖于源域与目标域之间的任何直接映射。这一优势对于在实践中应用跨模态数据翻译方法至关重要,因为监督训练所需的合理数量的源域数据集并不总是可用。我们通过实验证明了MIDiffusion在与一组有影响力的生成模型(包括基于对抗的模型和其他基于分数匹配的模型)相比时的先进性能。