Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establishing a unified relationship between them is particularly important. However, to date, there has been no work that considers them jointly to explore the modality alignment within. To bridge this gap, we propose a novel framework, termed MoMu-Diffusion, for long-term and synchronous motion-music generation. Firstly, to mitigate the huge computational costs raised by long sequences, we propose a novel Bidirectional Contrastive Rhythmic Variational Auto-Encoder (BiCoR-VAE) that extracts the modality-aligned latent representations for both motion and music inputs. Subsequently, leveraging the aligned latent spaces, we introduce a multi-modal Transformer-based diffusion model and a cross-guidance sampling strategy to enable various generation tasks, including cross-modal, multi-modal, and variable-length generation. Extensive experiments demonstrate that MoMu-Diffusion surpasses recent state-of-the-art methods both qualitatively and quantitatively, and can synthesize realistic, diverse, long-term, and beat-matched music or motion sequences. The generated samples and codes are available at https://momu-diffusion.github.io/
翻译:运动到音乐与音乐到运动的研究一直各自独立进行,并在其各自领域内吸引了大量研究兴趣。人体运动与音乐之间的互动反映了人类的高级智能,建立二者之间的统一关系尤为重要。然而,迄今为止,尚无工作将它们联合起来以探索其内在的模态对齐。为弥补这一空白,我们提出了一种新颖的框架,称为MoMu-Diffusion,用于长期且同步的运动-音乐生成。首先,为缓解长序列带来的巨大计算成本,我们提出了一种新颖的双向对比节奏变分自编码器,用于提取运动与音乐输入的模态对齐潜在表示。随后,利用对齐的潜在空间,我们引入了一种基于多模态Transformer的扩散模型和一种交叉引导采样策略,以实现包括跨模态、多模态和可变长度生成在内的多种生成任务。大量实验表明,MoMu-Diffusion在定性和定量上均超越了近期最先进的方法,并能合成逼真、多样、长期且节拍匹配的音乐或运动序列。生成的样本和代码可在 https://momu-diffusion.github.io/ 获取。