3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.
翻译:三维高斯泼溅(3DGS)在说话头的高保真合成中展现出巨大潜力。然而,由于语音驱动面部动态与显式表情信号之间存在内在冲突,实现细粒度、可解释且可编辑的面部表情控制仍面临根本性挑战。现有方法依赖隐式多模态融合,导致空间纠缠与时间不稳定性。我们提出EmoZone-Talker——一种新颖框架,将音频驱动面部动画重构为跨模态冲突下的结构化时空协调问题。该方法引入面部运动的显式空间解耦与时间动态建模。具体而言,我们提出协同区域优先注意力偏置(SZ-PAB),通过解剖先验引导的区域约束显式解耦模态贡献;以及通道独立时间AU编码器(CIT-AE),对时间连贯的AU动态进行建模。通过将这些表征集成至三维高斯形变,EmoZone-Talker实现了对面部表情的精确可解释控制。大量实验表明,本方法提升了表情可控性与真实感,在上面部精度与时间连贯性方面取得显著改进,同时保持高渲染质量与准确唇同步。代码将公开发布以促进可复现性与后续研究。