This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.
翻译:本文研究从语音生成全身运动的问题。尽管已有方法取得显著进展,但其仍难以从语音生成合理且多样化的全身运动。这主要源于现有方法依赖次优的运动表示,且缺乏生成多样化结果的策略。为解决这些挑战,我们提出一种新颖的混合点表示方法,以实现精确且连续的运动生成(例如避免足部滑动现象),该表示可转换为易于使用的SMPL-X人体网格表示以适配多种应用场景。针对语音生成全身运动的任务,对于与音频信号紧密相关的面部运动,我们采用编码器-解码器架构以实现确定性输出;而对于与音频信号关联较弱的身体和手部运动,我们的目标是生成多样化且合理的运动。为提升运动生成的多样性,我们提出对比运动学习方法以促使模型生成更具区分度的运动表示。具体而言,我们设计了鲁棒的VQ-VAE模型,利用混合表示学习量化运动码本;随后通过采用对比运动学习方法的翻译模型,从音频信号回归运动表示。实验结果验证了我们模型的优越性能与正确性。项目页面可通过http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct 访问以供研究使用。