We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available at https://pantomatrix.github.io/EMAGE/
翻译:我们提出EMAGE框架,用于从音频和掩码手势生成涵盖面部、局部身体、手部及整体运动的全身人体手势。为此,首先引入BEAT2(BEAT-SMPLX-FLAME),一种新的网格级全方位共语数据集。BEAT2结合了MoShed SMPLX身体模型与FLAME头部参数,并进一步优化头颈与手指运动的建模,提供社区标准化的高质量3D运动捕捉数据集。EMAGE在训练过程中利用掩码身体手势先验知识提升推理性能,其核心为掩码音频手势Transformer,通过联合训练音频到手势生成与掩码手势重建任务,高效编码音频和身体手势线索。编码后的身体线索被分别用于生成面部与身体运动。此外,EMAGE自适应融合音频的节奏与内容语音特征,并采用四个组合式VQ-VAE增强结果的保真度与多样性。实验表明,EMAGE在全方位手势生成中达到最优性能,且能灵活接受预定义的时空手势输入,生成与音频同步的完整结果。我们的代码与数据集开源在https://pantomatrix.github.io/EMAGE/