Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio. While recent methods achieve high-quality lip synchronization, they often rely on discrete emotion categories, limiting continuous and fine-grained emotional control. We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing. The key idea is a boundary-aware semantic embedding that learns the normal directions of inter-emotion decision boundaries, enabling a continuous expression manifold for smooth emotion manipulation. Moreover, we introduce an emotional consistency loss that enforces semantic alignment between the generated motion dynamics and the target emotion embedding through a mapping network, ensuring faithful emotional expression. Extensive experiments demonstrate that EditEmoTalk achieves superior controllability, expressiveness, and generalization while maintaining accurate lip synchronization. Code and pretrained models will be released.
翻译:语音驱动的三维面部动画旨在直接从音频生成逼真且富有表现力的面部运动。尽管现有方法已能实现高质量的唇形同步,但它们通常依赖于离散的情绪类别,限制了连续且细粒度的情绪控制。本文提出EditEmoTalk,一种具备连续情绪编辑能力的可控语音驱动三维面部动画框架。其核心思想是一种边界感知的语义嵌入,该方法学习情绪间决策边界的法线方向,从而构建一个连续的表情流形以实现平滑的情绪操控。此外,我们引入了一种情绪一致性损失,通过一个映射网络强制生成的动态运动与目标情绪嵌入在语义上对齐,从而确保情绪表达的忠实性。大量实验表明,EditEmoTalk在保持精确唇形同步的同时,实现了卓越的可控性、表现力与泛化能力。代码与预训练模型将予以公开。