Alpha matting is widely used in video conferencing as well as in movies, television, and social media sites. Deep learning approaches to the matte extraction problem are well suited to video conferencing due to the consistent subject matter (front-facing humans), however training-based approaches are somewhat pointless for entertainment videos where varied subjects (spaceships, monsters, etc.) may appear only a few times in a single movie -- if a method of creating ground truth for training exists, just use that method to produce the desired mattes. We introduce a training-free high quality neural matte extraction approach that specifically targets the assumptions of visual effects production. Our approach is based on the deep image prior, which optimizes a deep neural network to fit a single image, thereby providing a deep encoding of the particular image. We make use of the representations in the penultimate layer to interpolate coarse and incomplete "trimap" constraints. Videos processed with this approach are temporally consistent. The algorithm is both very simple and surprisingly effective.
翻译:阿尔法抠图广泛应用于视频会议、电影、电视及社交媒体平台。深度学习在蒙版提取问题上展现出对视频会议的天然适配性(因主体多为正面人像),然而基于训练的方法对娱乐视频而言意义有限——此类视频中仅需短暂出现的各类主体(如宇宙飞船、怪兽等)仅在单部影片中出现数次:若存在可生成训练真值的方法,直接采用该方法生成所需蒙版即可。我们提出了一种无需训练的高质量神经蒙版提取方法,专门针对视觉特效制作场景的假设条件。该方法基于深度图像先验,通过优化深度神经网络拟合单张图像,从而为该特定图像提供深度编码。我们利用倒数第二层的表征来插值粗粒度且不完整的"三分图"约束。采用该方法处理的视频在时间上具有一致性。该算法既极为简洁,又出人意料地有效。