Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync's superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.
翻译:伴随语音手势生成是旨在合成与语音同步的类人手势的关键研究领域。现有方法常存在节律不一致、运动抖动、脚步滑动及多采样多样性受限等问题。本文提出SmoothSync,一种创新框架,通过新颖的双流扩散Transformer架构利用量化音频令牌来合成整体手势并增强采样多样性。具体而言,我们(1)通过互补的Transformer流融合音频-运动特征以实现卓越的同步性,(2)引入抖动抑制损失以提升时间平滑度,(3)采用概率音频量化从相同输入生成差异化手势序列。为在抖动条件下可靠评估节拍同步性,我们提出Smooth-BC——一种对运动噪声不敏感的鲁棒性节拍一致性度量变体。在BEAT2和SHOW数据集上的综合实验表明,SmoothSync在BEAT2数据集上以-30.6% FGD、10.3% Smooth-BC和8.4%多样性的优势超越现有最优方法,同时分别将抖动和脚步滑动降低-62.9%和-17.1%。代码将开源以促进后续研究。