A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.
翻译:跳切会导致观看体验中突然且有时不必要的视觉变化。我们提出了一种新颖的框架,用于在语音头像视频中平滑这些跳切。该框架利用视频中其他源帧中主体的外观,将其与由DensePose关键点和面部关键点驱动的中层表征相融合。为实现运动效果,我们对跳切前后结束帧之间的关键点和面部关键点进行插值。随后,使用图像翻译网络从关键点和源帧合成像素。由于关键点可能存在误差,我们提出了一种跨模态注意力机制,用于在多个候选源中选择最合适的源帧对应每个关键点。通过利用这种中层表征,我们的方法比强视频插值基线能够取得更优的结果。我们在语音头像视频中的多种跳切场景下验证了该方法,包括剪除填充词、停顿甚至随机跳切。实验表明,即使在跳切过程中语音头像发生剧烈旋转或大幅度移动的挑战性场景中,我们的方法仍能实现无缝过渡。