While considerable progress has been made in achieving accurate lip synchronization for 3D speech-driven talking face generation, the task of incorporating expressive facial detail synthesis aligned with the speaker's speaking status remains challenging. Our goal is to directly leverage the inherent style information conveyed by human speech for generating an expressive talking face that aligns with the speaking status. In this paper, we propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking face generation. This system harnesses the robust contextual reasoning and hallucination capability offered by Large Language Models (LLMs) to instruct the realistic synthesis of 3D talking faces. Instead of directly learning facial movements from human speech, our two-stage strategy involves the LLMs first comprehending audio information and generating instructions implying expressive facial details seamlessly corresponding to the speech. Subsequently, a diffusion-based generative network executes these instructions. This two-stage process, coupled with the incorporation of LLMs, enhances model interpretability and provides users with flexibility to comprehend instructions and specify desired operations or modifications. Extensive experiments showcase the effectiveness of our approach in producing vivid talking faces with expressive facial movements and consistent emotional status.
翻译:尽管在三维语音驱动的说话人脸生成中实现精确的唇部同步已取得显著进展,但如何合成与说话者表达状态相符的富有表现力的面部细节仍然是一项挑战。本文旨在直接利用人类语音中蕴含的固有风格信息,生成与说话状态一致的表现力丰富的说话人脸。为此,我们提出AVI-Talking系统——一种基于音视频指令的说话人脸生成方法。该系统利用大型语言模型的强大上下文推理与幻觉能力,指导三维说话人脸的真实感合成。我们的两阶段策略并非直接从语音学习面部运动:首先,大型语言模型理解音频信息并生成隐含与语音无缝对应的表现性面部细节的指令;随后,基于扩散的生成网络执行这些指令。这一两阶段流程结合大型语言模型的引入,增强了模型的可解释性,并为用户提供了理解指令以及指定所需操作或修改的灵活性。大量实验证明了本方法在生成具有生动面部动作和一致情感状态的说话人脸方面的有效性。