Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion-matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaningful gesture units. With each code representing a unique gesture, random jittering problems are alleviated effectively. We then use Levenshtein distance to align diverse gestures with different speech. Levenshtein distance based on audio quantization as a similarity metric of corresponding speech of gestures helps match more appropriate gestures with speech, and solves the alignment problem of speech and gestures well. Moreover, we introduce phase to guide the optimal gesture matching based on the semantics of context or rhythm of audio. Phase guides when text-based or speech-based gestures should be performed to make the generated gestures more natural. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, database, pre-trained models, and demos are available at https://github.com/YoungSeng/QPGesture.
翻译:语音驱动的手势生成因人体运动的随机抖动而极具挑战性。此外,人类语音与手势之间天然存在异步关系。为解决这些问题,我们提出了一种新颖的基于量化与相位引导的运动匹配框架。具体而言,我们首先设计了一个手势VQ-VAE模块来学习编码本,以总结有意义的手势单元。每个编码代表一种独特手势,从而有效缓解了随机抖动问题。随后利用莱文斯坦距离将不同手势与不同语音进行对齐。基于语音量化的莱文斯坦距离作为手势对应语音的相似度度量,有助于匹配更恰当的手势与语音,并良好地解决语音与手势的对齐问题。此外,我们引入相位概念,根据上下文语义或音频节奏引导最优手势匹配。相位指示何时应执行基于文本或语音的手势,使生成的手势更加自然。大量实验表明,我们的方法在语音驱动手势生成任务上优于近期方法。我们的代码、数据库、预训练模型及演示均可在 https://github.com/YoungSeng/QPGesture 获取。