In this paper, we propose a scribble-based video colorization network with temporal aggregation called SVCNet. It can colorize monochrome videos based on different user-given color scribbles. It addresses three common issues in the scribble-based video colorization area: colorization vividness, temporal consistency, and color bleeding. To improve the colorization quality and strengthen the temporal consistency, we adopt two sequential sub-networks in SVCNet for precise colorization and temporal smoothing, respectively. The first stage includes a pyramid feature encoder to incorporate color scribbles with a grayscale frame, and a semantic feature encoder to extract semantics. The second stage finetunes the output from the first stage by aggregating the information of neighboring colorized frames (as short-range connections) and the first colorized frame (as a long-range connection). To alleviate the color bleeding artifacts, we learn video colorization and segmentation simultaneously. Furthermore, we set the majority of operations on a fixed small image resolution and use a Super-resolution Module at the tail of SVCNet to recover original sizes. It allows the SVCNet to fit different image resolutions at the inference. Finally, we evaluate the proposed SVCNet on DAVIS and Videvo benchmarks. The experimental results demonstrate that SVCNet produces both higher-quality and more temporally consistent videos than other well-known video colorization approaches. The codes and models can be found at https://github.com/zhaoyuzhi/SVCNet.
翻译:本文提出了一种基于涂鸦的时间聚合视频着色网络SVCNet,可依据用户提供的不同颜色涂鸦对单色视频进行着色。该网络针对基于涂鸦的视频着色领域的三个常见问题——着色生动性、时间一致性与颜色渗色——提出了解决方案。为提升着色质量并增强时间一致性,我们在SVCNet中采用两个顺序子网络,分别负责精确着色与时间平滑处理。第一阶段包含金字塔特征编码器以融合颜色涂鸦与灰度帧,以及语义特征编码器以提取语义信息;第二阶段通过聚合邻近着色帧信息(作为短程连接)与首个着色帧信息(作为长程连接)对第一阶段输出进行微调。为减轻颜色渗色伪影,我们同步学习视频着色与分割任务。此外,我们将大部分运算设置在固定的小图像分辨率下,并在SVCNet末端使用超分辨率模块恢复原始尺寸,使得SVCNet在推理阶段可适配不同图像分辨率。最终,我们在DAVIS与Videvo基准数据集上评估了所提SVCNet。实验结果表明,与现有知名视频着色方法相比,SVCNet能生成更高质量且时间一致性更强的视频。代码与模型可在https://github.com/zhaoyuzhi/SVCNet获取。