Audio-driven facial animation is essential for immersive digital interaction, yet existing frameworks fail to reconcile real-time streaming with high-fidelity personalization. Current methods often rely on latency-inducing audio look-ahead, or require high user compliance to pre-encode static embeddings that fails to capture dynamic idiosyncrasies. We present an end-to-end causal framework for personalizing causal facial motion generation via dynamic multi-modal style retrieval, enabling ultra-low latency while uniquely leveraging unstructured style references. We introduce two key innovations: (1) a temporal hierarchical motion representation that captures global temporal context and high-frequency details while maintaining decoding causality, and (2) a multi-modal style retriever that jointly queries audio and motion to dynamically extract stylistic priors without breaking causality. This mechanism allows for scalable personalization with total flexibility regarding the number and contents of templates. By integrating these components into a causal autoregressive architecture, our method significantly outperforms state-of-the-art approaches in lip-sync accuracy, identity consistency, and perceived realism, supported by extensive quantitative evaluations and user studies.
翻译:音频驱动面部动画对于沉浸式数字交互至关重要,但现有框架难以兼顾实时流式传输与高保真个性化。当前方法通常依赖引发延迟的音频超前处理,或要求用户高度配合预编码静态嵌入,无法捕获动态习癖。我们提出一种端到端因果框架,通过动态多模态风格检索实现个性化因果面部运动生成,在保持超低延迟的同时独特地利用非结构化风格参考。我们引入两项关键创新:(1) 时序分层运动表征,在保持解码因果性的同时捕获全局时序上下文与高频细节;(2) 多模态风格检索器,联合查询音频与运动以动态提取风格先验,且不违反因果性。该机制支持可扩展的个性化,对模板数量与内容具有完全灵活性。通过将这些组件集成到因果自回归架构中,我们的方法在唇形同步精度、身份一致性与感知真实感方面显著超越现有最优方法,这一结论得到大量定量评估与用户研究的支持。