面向通用模态转换的对比与预测性潜在扩散桥接模型 (Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge)

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

翻译：近年来，生成建模的进展使扩散模型成为从复杂数据分布中采样的最先进工具。尽管这些模型在图像和音频等单模态领域取得了显著成功，但将其能力扩展到模态转换——即在不同感知模态间转换信息——仍是一个开放的挑战。现有方法通常依赖于限制性假设，包括共享维度、高斯源先验和模态特定架构，这限制了其通用性和理论基础。本文提出潜在去噪扩散桥接模型，这是一种基于潜在变量扩展的去噪扩散桥接模型的通用模态转换框架。通过在共享潜在空间中操作，我们的方法能够学习任意模态间的桥接，而无需对齐维度。我们引入了对比对齐损失以增强配对样本间的语义一致性，并设计了适用于潜在空间噪声预测的领域无关编码器-解码器架构。此外，我们提出预测性损失以引导训练实现准确的跨域转换，并探索了多种训练策略以提升稳定性。我们的方法支持任意模态对，并在多种模态转换任务中表现优异，包括多视图到三维形状生成、图像超分辨率和多视图场景合成。全面的实验与消融研究验证了我们框架的有效性，为通用模态转换建立了新的强基准。更多信息请访问项目页面：https://sites.google.com/view/lddbm/home。