We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEATX (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEATX combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available at https://pantomatrix.github.io/EMAGE/
翻译:我们提出EMAGE框架,用于从音频和掩码手势生成包含面部、局部身体、双手及全局动作的人体全身手势。为实现这一目标,我们首先引入BEATX(BEAT-SMPLX-FLAME)——一种全新的网格级全方位共语数据集。该数据集将MoShed SMPLX人体模型与FLAME头部参数相结合,并进一步优化了头、颈及手指运动的建模,为社区提供了标准化、高质量的三维运动捕捉数据。EMAGE在训练过程中利用掩码身体手势先验来提升推理性能,其核心架构为掩码音频手势Transformer,通过联合训练音频到手势生成与掩码手势重建任务,有效编码音频与身体手势提示。编码后的掩码手势身体提示被分别用于面部和身体动作的生成。此外,EMAGE能自适应融合音频节奏与内容中的语音特征,并采用四个组合式VQ-VAE增强生成结果的保真度与多样性。实验表明,EMAGE生成的全方位手势达到当前最优性能,且能灵活接受预定义的时空手势输入,生成完整的音频同步结果。我们的代码与数据集发布在https://pantomatrix.github.io/EMAGE/