A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.
翻译:跳剪会在观看体验中产生突兀甚至不必要的跳跃感。我们提出了一种新型框架,用于在说话人视频场景中平滑此类跳剪。该方法利用视频中其他源帧中主体的外观特征,将其与基于DensePose关键点和面部标志点驱动的中层表征进行融合。为实现运动补偿,我们对剪切两端边界帧之间的关键点和标志点进行插值处理。随后通过图像翻译网络,结合关键点与源帧合成像素。针对关键点可能存在的误差问题,我们提出跨模态注意力机制,在多个候选源中为每个关键点选择最合适的匹配源。通过利用这种中层表征,我们的方法能够获得比强视频插值基线更优的结果。我们在各类说话人视频跳剪场景(如填充词删除、停顿剪切乃至随机剪切)中验证了该方法。实验表明,即使在说话人头部剧烈旋转或大幅度移动的跳剪挑战性案例中,该方法仍能实现无缝过渡效果。