Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.
翻译:手势在增强伴随语音的交流中起着关键作用。尽管近期研究大多集中于通过数据驱动方法实现点级运动变换或全监督的运动表示,我们探索了伴随语音中手势的表示,重点关注自监督表示和像素级运动偏差,并利用融合了潜在运动特征的扩散模型。我们的方法利用潜在表示中的自监督偏差来促进手部手势的生成,这对于生成逼真的手势视频至关重要。我们首次实验的结果表明,我们的方法提升了生成视频的质量,在FGD、DIV和FVD指标上相较于当前最先进方法提升了2.7%至4.5%,在PSNR上提升了8.1%,在SSIM上提升了2.5%。