In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.
翻译:本文提出了一种基于舞蹈运动向量量化变分自编码器(VQ-VAE)模型和运动生成式预训练(GPT)模型的音乐条件三维舞蹈生成模型MIDGET,旨在生成与音乐节奏匹配且具有活力与高质量的动作。为应对该领域的挑战,我们引入了三个新组件:1)基于运动VQ-VAE模型的预训练记忆码本,用于存储不同人体姿态编码;2)采用运动GPT模型,通过音乐与运动编码器生成姿态编码;3)一个简洁的音乐特征提取框架。我们与现有最先进模型进行对比,并在当前最大的公开音乐-舞蹈数据集AIST++上开展消融实验。实验表明,本框架在动作质量及其与音乐的对齐度上均达到了最先进水平。