We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities. Unlike existing works that generate dance movements using a single modality such as music, our goal is to produce richer dance movements guided by the instructive information provided by the text. However, the lack of paired motion data with both music and text modalities limits the ability to generate dance movements that integrate both. To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, which effectively mix the motion tokens from the two datasets with different distributions for training. Additionally, we propose a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements without degrading the performance of music-conditioned dance generation. To better evaluate the quality of the generated motion, we introduce two novel metrics, namely Motion Prediction Distance (MPD) and Freezing Score (FS), to measure the coherence and freezing percentage of the generated motion. Extensive experiments show that our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities. Code is available at https://garfield-kh.github.io/TM2D/.
翻译:我们提出了一项新任务,即生成同时融合文本与音乐模态的三维舞蹈动作。与现有仅利用音乐等单模态生成舞蹈动作的研究不同,我们的目标是利用文本提供的引导信息生成更丰富的舞蹈动作。然而,由于缺乏同时包含音乐与文本模态的配对动作数据,生成融合两种模态的舞蹈动作受到限制。为缓解这一挑战,我们提出利用三维人体运动VQ-VAE将两个数据集的运动投影到由量化向量组成的潜在空间中,从而有效混合来自两个不同分布数据集的运动令牌进行训练。此外,我们提出一种跨模态变换器,将文本指令集成到运动生成架构中,在生成三维舞蹈动作时不影响音乐条件舞蹈生成的性能。为更好地评估生成动作的质量,我们引入两个新指标——运动预测距离(MPD)与冻结分数(FS),以衡量生成动作的连贯性与冻结比例。大量实验表明,我们的方法能够在文本与音乐双重条件下生成逼真且连贯的舞蹈动作,同时保持与两种单模态相当的性能。代码已开源至 https://garfield-kh.github.io/TM2D/。