Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acoustic and semantic features but also conveys personality traits, emotions, and more subtle information related to accompanying gestures, we pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract low-level and high-level audio information. Secondly, we introduce an adaptive layer norm architecture in the transformer-based layer to learn the relationship between speech information and accompanying gestures. Extensive subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT datasets to confirm the WavLM and the model's ability to synthesize natural co-speech gestures with various styles.

翻译：数字人类伴随语音的手势生成是虚拟人创造领域的一个新兴研究方向。已有研究通过采用声学与语义信息作为输入，并结合分类方法识别个体身份与情感来驱动伴随语音的手势生成，取得了一定进展。然而，该领域仍面临重大挑战。这些挑战不仅涉及伴随语音的手势、语音声学与语义之间的复杂交互，还涵盖与个性、情感及其他隐晦但关键因素相关的复杂性。本文提出"diffmotion-v2"——一个基于WavLM预训练模型的语音条件扩散式非自回归Transformer生成模型。该模型仅需原始语音音频即可生成个性化且风格化的全身伴随语音手势，无需复杂的多模态处理与人工标注。首先，鉴于语音音频不仅包含声学与语义特征，还传递人格特质、情感及与伴随手势相关的更细微信息，我们开创性地采用大规模预训练模型WavLM来提取低层与高层音频信息。其次，我们在基于Transformer的层级中引入自适应层归一化架构，以学习语音信息与伴随手势之间的关联。通过在Trinity、ZEGGS和BEAT数据集上进行广泛的主观评估实验，证实了WavLM及该模型合成具有多种风格的自然伴随语音手势的能力。