Recent neural talking radiance field methods have shown great success in photorealistic audio-driven talking face synthesis. In this paper, we propose a novel interactive framework that utilizes human instructions to edit such implicit neural representations to achieve real-time personalized talking face generation. Given a short speech video, we first build an efficient talking radiance field, and then apply the latest conditional diffusion model for image editing based on the given instructions and guiding implicit representation optimization towards the editing target. To ensure audio-lip synchronization during the editing process, we propose an iterative dataset updating strategy and utilize a lip-edge loss to constrain changes in the lip region. We also introduce a lightweight refinement network for complementing image details and achieving controllable detail generation in the final rendered image. Our method also enables real-time rendering at up to 30FPS on consumer hardware. Multiple metrics and user verification show that our approach provides a significant improvement in rendering quality compared to state-of-the-art methods.
翻译:近期神经说话辐射场方法在真实感音频驱动说话人脸合成方面取得了巨大成功。本文提出了一种新颖的交互框架,利用人类指令编辑此类隐式神经表示,实现实时个性化说话人脸生成。给定一段短语音视频,我们首先构建高效说话辐射场,然后应用最新条件扩散模型基于给定指令进行图像编辑,并引导隐式表示优化朝着编辑目标演进。为确保编辑过程中音频-嘴唇同步,我们提出迭代数据集更新策略,并利用嘴唇边缘损失约束嘴唇区域变化。我们还引入轻量级细化网络补充图像细节,并在最终渲染图像中实现可控细节生成。我们的方法还能在消费级硬件上实现高达30FPS的实时渲染。多指标和用户验证表明,与最先进方法相比,我们的方法在渲染质量上取得了显著提升。