We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
翻译:我们提出DiffSHEG,一种基于扩散的任意长度语音驱动整体3D表情与手势生成方法。以往研究主要关注语音协同的手势或表情独立生成,而同步表情与手势的联合生成仍鲜有探索。为解决此问题,我们提出的基于扩散的语音协同运动生成Transformer实现了从表情到手势的单向信息流,从而促进联合表情-手势分布的匹配优化。此外,我们引入基于外推采样的扩散模型任意长序列生成策略,兼具灵活性与计算效率。该方法提供了由语音驱动的高质量同步表情与手势生成的实用解决方案。在两个公开数据集上的评估表明,我们的方法在定量与定性指标上均达到当前最优水平。用户研究进一步证实了DiffSHEG相较于先前方法的优越性。通过实现富有表现力的同步运动实时生成,DiffSHEG展示了其在数字人及具身智能体开发中的广泛应用潜力。