Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.
翻译:有声视频生成(SVG)是一项音频-视频联合生成任务,面临高维信号空间、异构数据格式及内容信息模式差异等挑战。为解决这些问题,我们针对SVG任务提出了一种新颖的多模态潜在扩散模型(MM-LDM)。我们首先通过将音频和视频数据转换为单张或成对图像来统一其表征形式。随后,引入分层多模态自编码器,为每个模态构建低层感知潜在空间,并建立共享的高层语义特征空间。前者在感知层面与各模态原始信号空间等价,但显著降低了信号维度;后者用于弥合模态间的信息鸿沟,并提供更具洞察力的跨模态引导。所提方法在质量和效率上均取得显著提升,实现了新的最优性能。具体而言,我们的方法在Landscape和AIST++数据集上实现了全部评估指标的全面提升,同时具有更快的训练与采样速度。此外,我们通过开放域有声视频生成、长时序有声视频生成、音频延续、视频延续及条件单模态生成任务的系统性评估,验证了MM-LDM卓越的适应性与泛化能力。