Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.
翻译:近期,说话虚拟人生成模型在实现与音频的逼真且准确的唇形同步方面取得了进展,但在控制和传达虚拟人的细微表情与情感方面仍显不足,导致生成的视频缺乏生动性与可控性。本文提出一种新颖的文本引导方法,用于生成具有情感表现力的二维虚拟人,该方法能实现对生成视频的细粒度控制、提升交互性并增强泛化能力。我们提出的框架命名为InstructAvatar,其利用自然语言界面来控制虚拟人的情感及面部运动。在技术上,我们设计了一套自动标注流程来构建指令-视频配对的训练数据集,并配备了一种新颖的双分支扩散生成器,可同时依据音频与文本指令预测虚拟人生成。实验结果表明,InstructAvatar生成的视频能很好地同时满足两种条件,并在细粒度情感控制、唇形同步质量与自然度方面优于现有方法。项目页面为 https://wangyuchi369.github.io/InstructAvatar/。