We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.
翻译:我们关注一项新任务,即低资源文本到说话人像合成。给定仅数分钟长的说话人视频(含音频轨道)作为训练数据,并以任意文本作为驱动输入,旨在合成与输入文本对应的高质量说话人像视频。该任务在数字人产业中具有广泛应用前景,但至今尚未在技术上实现,主要面临两大挑战:(1)传统多说话人文本转语音系统难以从域外音频中模仿音色特征;(2)在有限训练数据下难以渲染高保真且唇形同步的说话人像。本文提出自适应文本到说话人像(Ada-TTA)方法,其(1)设计了一个通用的零样本多说话人TTS模型,能够有效解耦文本内容、音色与韵律;(2)融合神经渲染领域的最新进展,实现逼真的音频驱动说话人脸视频生成。通过上述设计,我们的方法攻克了两大挑战,成功生成保留身份特征的语音及逼真的说话人视频。实验表明,该方法可合成真实、身份保留且音视频同步的说话人像视频。