This study delves into the intricacies of synchronizing facial dynamics with multilingual audio inputs, focusing on the creation of visually compelling, time-synchronized animations through diffusion-based techniques. Diverging from traditional parametric models for facial animation, our approach, termed LinguaLinker, adopts a holistic diffusion-based framework that integrates audio-driven visual synthesis to enhance the synergy between auditory stimuli and visual responses. We process audio features separately and derive the corresponding control gates, which implicitly govern the movements in the mouth, eyes, and head, irrespective of the portrait's origin. The advanced audio-driven visual synthesis mechanism provides nuanced control but keeps the compatibility of output video and input audio, allowing for a more tailored and effective portrayal of distinct personas across different languages. The significant improvements in the fidelity of animated portraits, the accuracy of lip-syncing, and the appropriate motion variations achieved by our method render it a versatile tool for animating any portrait in any language.
翻译:本研究深入探讨了多语言音频输入与面部动态同步的复杂问题,重点研究如何通过基于扩散的技术创建视觉上引人注目且时间同步的动画。与传统用于面部动画的参数化模型不同,我们提出的方法LinguaLinker采用了一种整体的基于扩散的框架,该框架集成了音频驱动的视觉合成,以增强听觉刺激与视觉响应之间的协同作用。我们分别处理音频特征并推导出相应的控制门,这些控制门隐式地控制嘴部、眼睛和头部的运动,而与肖像的来源无关。先进的音频驱动视觉合成机制提供了精细的控制,同时保持了输出视频与输入音频的兼容性,从而能够对不同语言下的不同人物角色进行更定制化和有效的描绘。我们的方法在动画肖像的保真度、口型同步的准确性以及适当的运动变化方面取得的显著改进,使其成为适用于任何语言、任何肖像动画的通用工具。