Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.
翻译:视觉语言潜在扩散模型(LDMs, Rombach等人,2022)为逆问题提供了强大的生成先验。然而,现有基于LDM的逆求解器通常需要大量的神经函数评估(NFE)以及对大型预训练组件的反向传播,导致巨大的计算成本,并在某些情况下降低重建质量。我们提出了一种统一的欧几里得-瓦瑟斯坦-2梯度流框架,该框架通过一个同时对齐先验与后验及观测数据的单一流,在潜在空间中联合执行后验采样和提示优化。结合少步潜在文本到图像模型,该公式能够在无需通过自编码器反向传播的情况下实现低NFE推理。在多个经典成像逆问题上的实验表明,我们的方法以显著降低的计算成本实现了最先进的性能。