Generating high-quality and person-generic visual dubbing remains a challenge. Recent innovation has seen the advent of a two-stage paradigm, decoupling the rendering and lip synchronization process facilitated by intermediate representation as a conduit. Still, previous methodologies rely on rough landmarks or are confined to a single speaker, thus limiting their performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting renderer incorporating a mask to delineate editable zones and unaltered regions. This allows for seamless filling of the lower-face region while preserving the remaining parts. Throughout our experiments, we encountered several challenges. Primarily, the semantic encoder lacks robustness, constricting its ability to capture high-level features. Besides, the modeling ignored facial positioning, causing mouth or nose jitters across frames. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance. Moreover, we encapsulated a conformer-based reference encoder and motion generator fortified by a cross-attention mechanism. This enables our model to learn person-specific textures with varying references and reduces reliance on paired audio-visual data. Our rigorous experiments comprehensively highlight that our ground-breaking approach outpaces existing methods with considerable margins and delivers seamless, intelligible videos in person-generic and multilingual scenarios.
翻译:生成高质量且不限定特定人物的通用视觉配音仍是一项挑战。近期创新催生了双阶段范式,通过中间表示作为媒介解耦渲染与唇形同步过程。然而,现有方法依赖粗糙的 landmarks 或局限于单一说话人,从而限制了其性能。本文提出 DiffDub:基于扩散的配音技术。我们首先通过修复渲染器构建扩散自编码器,该渲染器融入掩码以划分可编辑区域与未修改区域,从而在保留面部其他部分的同时无缝填充下半面部区域。在实验过程中,我们遇到了若干挑战。首先,语义编码器缺乏鲁棒性,限制了其捕捉高层特征的能力。此外,模型忽略了面部定位,导致帧间嘴部或鼻部抖动。为解决这些问题,我们采用多种策略,包括数据增强与补充性眼部引导。进一步,我们封装了基于 conformer 的参考编码器和由交叉注意力机制增强的运动生成器。这使得模型能够学习具有不同参考图像的特异性人物纹理,并减少对配对音视频数据的依赖。严格的实验全面表明,我们的突破性方法以显著优势超越现有方法,并在通用人物及多语言场景中生成流畅、易理解视频。