Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
翻译:近年来,视频-音频理解与生成领域的进展日益依赖于联合的V-A嵌入表示,这些表示构成了跨模态检索与生成等任务的基础。尽管先前的方法(如CAVP)利用对比目标有效建模了模态间的语义与时间对应关系,但其性能仍非最优。一个关键局限在于对视频和音频信号密集、多尺度特性的建模不足——对应关系往往跨越从细粒度到粗粒度的时空结构,而现有框架未能充分利用这一点。为此,我们提出GMS-CAVP,一种新颖的框架,它结合了多尺度视频-音频对齐与基于多尺度时空扩散的预训练目标,以增强V-A对应关系建模。首先,GMS-CAVP引入了一种多尺度对比学习策略,以捕捉不同粒度下的语义与时间关系。其次,我们超越了传统的对比学习,融入了一个基于扩散的生成式目标,实现了视频与音频之间的模态转换与合成。这种统一的判别式-生成式框架促进了更深层次的跨模态理解,并为高保真生成铺平了道路。在VGGSound、AudioSet和Panda70M数据集上的大量实验表明,GMS-CAVP在生成与检索任务上均优于先前方法。