Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans.
翻译:随着大语言模型的爆发式发展、语音合成技术的改进、硬件性能的提升以及计算机图形学的演进,当前创建数字人的瓶颈在于生成与文本或语音输入自然对应的人物动作。在本工作中,我们提出了DeepGesture,一个基于扩散的手势合成框架,用于根据多模态信号——文本、语音、情感和种子动作——生成富有表现力的伴随语音手势。DeepGesture在DiffuseStyleGesture模型基础上构建,引入了新颖的架构增强,以提升生成手势的语义对齐度和情感表现力。具体而言,我们整合了快速文本转录作为语义条件,并实现了情感引导的无分类器扩散,以支持跨情感状态的可控手势生成。为可视化结果,我们在Unity中基于模型的BVH输出实现了一套完整的渲染管线。在ZeroEGGS数据集上的评估表明,DeepGesture生成的手势在拟人度和上下文适宜性方面均有提升。我们的系统支持情感状态间的插值,并展示了对分布外语音(包括合成语音)的泛化能力——这标志着向全模态、情感感知的数字人迈出了重要一步。