In this paper, we introduce the DiffuseStyleGesture+, our solution for the Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which aims to foster the development of realistic, automated systems for generating conversational gestures. Participants are provided with a pre-processed dataset and their systems are evaluated through crowdsourced scoring. Our proposed model, DiffuseStyleGesture+, leverages a diffusion model to generate gestures automatically. It incorporates a variety of modalities, including audio, text, speaker ID, and seed gestures. These diverse modalities are mapped to a hidden space and processed by a modified diffusion model to produce the corresponding gesture for a given speech input. Upon evaluation, the DiffuseStyleGesture+ demonstrated performance on par with the top-tier models in the challenge, showing no significant differences with those models in human-likeness, appropriateness for the interlocutor, and achieving competitive performance with the best model on appropriateness for agent speech. This indicates that our model is competitive and effective in generating realistic and appropriate gestures for given speech. The code, pre-trained models, and demos are available at https://github.com/YoungSeng/DiffuseStyleGesture/tree/DiffuseStyleGesturePlus/BEAT-TWH-main.
翻译:本文介绍了DiffuseStyleGesture+,这是我们参与2023年具身智能体非语言行为生成与评估(GENEA)挑战赛的解决方案。该挑战赛旨在促进生成对话手势的逼真自动化系统研发。参赛者使用预处理数据集进行系统开发,并通过众包评分进行评估。我们提出的DiffuseStyleGesture+模型利用扩散模型自动生成手势,整合了音频、文本、说话人标识和种子手势等多种模态信息。这些多模态数据被映射至隐空间后,经改进的扩散模型处理,最终为给定的语音输入生成对应手势。评估结果表明,DiffuseStyleGesture+与挑战赛中的顶级模型性能持平:在拟人度、对话对象适配性方面无显著差异,在与智能体语音的适配性上达到最佳模型水平。这一结果证明我们的模型在生成逼真且适配语音的手势方面具有竞争力和有效性。相关代码、预训练模型及演示示例已开源至https://github.com/YoungSeng/DiffuseStyleGesture/tree/DiffuseStyleGesturePlus/BEAT-TWH-main。