SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

翻译：高质量AI视频配音需要精确的音频-唇形同步、高保真视觉生成以及对身份与背景的忠实保持。现有方法大多依赖基于掩码的训练策略：在说话人视频中掩蔽嘴部区域，使模型学习从受损输入和目标音频合成唇部动作。虽然这提升了唇形同步精度，但破坏了时空上下文，损害了动态面部动作的表现力，并导致面部结构与背景一致性的不稳定。为克服此局限，我们提出SyncAnyone——一种新颖的两阶段学习框架，可同时实现精确运动建模与高视觉保真度。在第一阶段，我们训练基于扩散的视频Transformer进行掩蔽嘴部修复，利用其强大的时空建模能力生成准确且音频驱动的唇部动作。然而，由于输入受损，面部周边区域与背景可能出现细微伪影。在第二阶段，我们开发了无掩码微调流程以消除掩码引发的伪影。具体而言，在第一阶段模型基础上，我们构建了数据生成流程：通过源视频与随机采样音频合成唇形同步视频，创建伪配对训练样本。我们进一步在此合成数据上微调第二阶段模型，实现了精准的唇部编辑与更优的背景一致性。大量实验表明，在野外唇语同步场景下，本方法在视觉质量、时序连贯性与身份保持方面均达到最先进水平。