Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.
翻译:当前说话人虚拟形象主要基于语音音频和文本生成伴随言语的手势,而未考虑说话人的非言语动作。此外,现有伴随言语手势生成工作多基于独立手势数据集设计网络结构,导致数据量有限、泛化能力不足且说话人动作受限。为解决上述问题,我们提出FreeTalker框架——据我们所知,这是首个能够同时生成自发性动作(如伴随言语手势)与非自发性动作(如绕讲台移动)的说话人动作生成框架。具体而言,我们训练了一个基于扩散模型的说话人动作生成模型,该模型采用语音驱动手势和文本驱动动作的统一表示,并利用来自多个动作数据集的异构数据。在推理阶段,我们通过无分类器引导实现对动作片段风格的高度可控。此外,为生成片段间的平滑过渡,我们采用DoubleTake方法,该方法利用生成先验实现无缝动作融合。大量实验表明,我们的方法能够生成自然且可控的说话人动作。代码、模型及演示已公开于 \url{https://youngseng.github.io/FreeTalker/}。