Diffusion models excel in noise-to-data generation tasks, providing a mapping from a Gaussian distribution to a more complex data distribution. However they struggle to model translations between complex distributions, limiting their effectiveness in data-to-data tasks. While Bridge Matching (BM) models address this by finding the translation between data distributions, their application to time-correlated data sequences remains unexplored. This is a critical limitation for video generation and manipulation tasks, where maintaining temporal coherence is particularly important. To address this gap, we propose Time-Correlated Video Bridge Matching (TCVBM), a framework that extends BM to time-correlated data sequences in the video domain. TCVBM explicitly models inter-sequence dependencies within the diffusion bridge, directly incorporating temporal correlations into the sampling process. We compare our approach to classical methods based on bridge matching and diffusion models for three video-related tasks: frame interpolation, image-to-video generation, and video super-resolution. TCVBM achieves superior performance across multiple quantitative metrics, demonstrating enhanced generation quality and reconstruction fidelity.
翻译:扩散模型在从噪声到数据的生成任务中表现出色,能够实现从高斯分布到更复杂数据分布的映射。然而,它们在建模复杂分布之间的转换方面存在困难,这限制了其在数据到数据任务中的有效性。虽然桥匹配模型通过寻找数据分布之间的转换来解决这一问题,但其在时间相关数据序列上的应用仍未被探索。这对于视频生成与编辑任务而言是一个关键局限,因为保持时间一致性在这些任务中尤为重要。为填补这一空白,我们提出了时间相关视频桥匹配,这是一个将桥匹配扩展至视频领域时间相关数据序列的框架。TCVBM在扩散桥中显式建模序列间的依赖关系,将时间相关性直接纳入采样过程。我们将所提方法与基于桥匹配和扩散模型的经典方法在三个视频相关任务上进行了比较:帧插值、图像到视频生成以及视频超分辨率。TCVBM在多项定量指标上均取得了更优性能,展现出更强的生成质量与重建保真度。