Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$\times$ faster than traditional diffusion transformers and an inference speed that is 5.7$\times$ than the standard diffusion model.
翻译:近年来,扩散Transformer领域的进展显著提升了高质量二维图像、三维视频及三维形状的生成能力。然而,在伴随语音的手势生成任务中,Transformer架构的有效性尚未得到充分探索,先前的方法主要采用卷积神经网络或简单的少量Transformer层。为填补这一研究空白,我们提出了一种新颖的掩码扩散Transformer用于伴随语音手势生成,称为MDT-A2G,该模型直接在手势序列上执行去噪过程。为增强时间对齐的语音驱动手势的上下文推理能力,我们引入了一种新型掩码扩散Transformer。该模型采用专门设计的掩码建模方案,以强化序列手势间的时间关系学习,从而加速学习过程并生成连贯且逼真的动作。除音频外,我们的MDT-A2G模型还整合了多模态信息,包括文本、情感和身份特征。此外,我们提出了一种高效的推理策略,通过利用先前计算结果来减少去噪运算量,从而在性能损失可忽略的情况下实现加速。实验结果表明,MDT-A2G在手势生成方面表现卓越,其学习速度比传统扩散Transformer快6倍以上,推理速度比标准扩散模型快5.7倍。