GAN inversion is indispensable for applying the powerful editability of GAN to real images. However, existing methods invert video frames individually often leading to undesired inconsistent results over time. In this paper, we propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo \textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos. Our approach models the temporal relations between current and previous frames from three aspects. To enable a faithful real video reconstruction, we first maximize the inversion fidelity and consistency by learning a temporal compensated latent code. Second, we observe incoherent noises lie in the high-frequency domain that can be disentangled from the latent space. Third, to remove the inconsistency after attribute manipulation, we propose an \textit{in-between frame composition constraint} such that the arbitrary frame must be a direct composite of its neighboring frames. Our unified framework learns the inherent coherence between input frames in an end-to-end manner, and therefore it is agnostic to a specific attribute and can be applied to arbitrary editing of the same video without re-training. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods qualitatively and quantitatively in both inversion and editing tasks. The deliverables can be found in \url{https://cnnlstm.github.io/RIGID}
翻译:摘要:GAN逆映射是将GAN强大的可编辑性应用于真实图像的关键技术。然而,现有方法逐帧处理视频,常导致时间上不一致的伪影。本文提出统一循环框架——循环视频GAN逆映射与编辑(RIGID),显式同步实现真实视频的时域连贯GAN逆映射与面部编辑。该方法从三个方面建模当前帧与先前帧的时间关联:首先,通过学习时域补偿潜码最大化逆映射保真度与一致性,从而保证真实视频的高保真重建;其次,观察到高频域中存在与潜空间可解耦的不一致噪声;第三,为解决属性编辑后的不连贯问题,提出“帧间合成约束”,确保任意帧必须为其邻接帧的直接组合。该统一框架以端到端方式学习输入帧间的内在连贯性,因此与具体属性无关,无需重新训练即可对同一视频进行任意编辑。大量实验表明,RIGID在逆映射与编辑任务中均显著优于现有方法。相关资源见:https://cnnlstm.github.io/RIGID