Image stylization has seen significant advancement and widespread interest over the years, leading to the development of a multitude of techniques. Extending these stylization techniques, such as Neural Style Transfer (NST), to videos is often achieved by applying them on a per-frame basis. However, per-frame stylization usually lacks temporal consistency, expressed by undesirable flickering artifacts. Most of the existing approaches for enforcing temporal consistency suffer from one or more of the following drawbacks: They (1) are only suitable for a limited range of techniques, (2) do not support online processing as they require the complete video as input, (3) cannot provide consistency for the task of stylization, or (4) do not provide interactive consistency control. Domain-agnostic techniques for temporal consistency aim to eradicate flickering completely but typically disregard aesthetic aspects. For stylization tasks, however, consistency control is an essential requirement as a certain amount of flickering adds to the artistic look and feel. Moreover, making this control interactive is paramount from a usability perspective. To achieve the above requirements, we propose an approach that stylizes video streams in real-time at full HD resolutions while providing interactive consistency control. We develop a lite optical-flow network that operates at 80 FPS on desktop systems with sufficient accuracy. Further, we employ an adaptive combination of local and global consistency features and enable interactive selection between them. Objective and subjective evaluations demonstrate that our method is superior to state-of-the-art video consistency approaches.
翻译:图像风格化技术近年来取得了显著进展并引起广泛关注,催生了多种技术方法。将神经风格迁移(NST)等风格化技术拓展至视频领域时,通常采用逐帧应用的方式。然而,逐帧风格化往往缺乏时间一致性,表现为令人不适的闪烁伪影。现有时间一致性增强方法大多存在以下若干缺陷:(1)仅适用于有限的技术范围;(2)需要完整视频作为输入,不支持在线处理;(3)无法为风格化任务提供一致性保障;(4)缺乏交互式一致性控制。面向时间一致性的领域无关技术旨在完全消除闪烁,但通常忽视美学考量。然而对风格化任务而言,一致性控制是核心需求——适度的闪烁反而能增强艺术表现力。从可用性角度看,实现交互式控制尤为重要。为满足上述需求,我们提出一种方法,能在全高清分辨率下实时处理视频流,同时提供交互式一致性控制。我们设计了轻量级光流网络,在桌面系统上以80 FPS运行且保持足够精度。通过自适应融合局部与全局一致性特征,实现二者的交互式选择。客观评估与主观实验表明,该方法优于当前最先进的视频一致性方法。