The generation of co-speech gestures for digital humans is an emerging area in the field of virtual human creation. Prior research has made progress by using acoustic and semantic information as input and adopting classify method to identify the person's ID and emotion for driving co-speech gesture generation. However, this endeavour still faces significant challenges. These challenges go beyond the intricate interplay between co-speech gestures, speech acoustic, and semantics; they also encompass the complexities associated with personality, emotion, and other obscure but important factors. This paper introduces "diffmotion-v2," a speech-conditional diffusion-based and non-autoregressive transformer-based generative model with WavLM pre-trained model. It can produce individual and stylized full-body co-speech gestures only using raw speech audio, eliminating the need for complex multimodal processing and manually annotated. Firstly, considering that speech audio not only contains acoustic and semantic features but also conveys personality traits, emotions, and more subtle information related to accompanying gestures, we pioneer the adaptation of WavLM, a large-scale pre-trained model, to extract low-level and high-level audio information. Secondly, we introduce an adaptive layer norm architecture in the transformer-based layer to learn the relationship between speech information and accompanying gestures. Extensive subjective evaluation experiments are conducted on the Trinity, ZEGGS, and BEAT datasets to confirm the WavLM and the model's ability to synthesize natural co-speech gestures with various styles.
翻译:面向数字人的共语手势生成是虚拟人创建领域的新兴方向。现有研究通过采用声学与语义信息作为输入,并利用分类方法识别人物身份与情感以驱动共语手势生成,已取得一定进展。然而,该领域仍面临重大挑战。这些挑战不仅涉及共语手势、语音声学与语义之间的复杂交互,更包含与个性、情感及其他隐晦但关键因素相关的复杂性。本文提出"diffmotion-v2",一种基于WavLM预训练模型的条件扩散与非自回归Transformer生成模型。该模型仅需原始语音音频即可生成个性化且风格化的全身共语手势,无需复杂的多模态处理与人工标注。首先,考虑到语音音频不仅包含声学与语义特征,还携带个性特质、情绪及与伴随手势相关的更细微信息,我们首创性地采用大规模预训练模型WavLM提取低层与高层音频信息。其次,我们在基于Transformer的层中引入自适应层归一化架构,以学习语音信息与伴随手势之间的映射关系。通过在Trinity、ZEGGS和BEAT数据集上开展大规模主观评估实验,验证了WavLM及该模型合成具有多样风格的自然共语手势的能力。