Video frame interpolation (VFI), which aims to synthesize intermediate frames of a video, has made remarkable progress with development of deep convolutional networks over past years. Existing methods built upon convolutional networks generally face challenges of handling large motion due to the locality of convolution operations. To overcome this limitation, we introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Further, our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other. This design effectively enlarges the receptive field and aggregates multi-scale information. Extensive quantitative and qualitative experiments demonstrate that our method achieves new state-of-the-art results on various benchmarks.
翻译:多年来,在深层革命网络的发展方面取得了显著的进展; 以革命网络为基础的现有方法一般都面临因革命行动地点而处理大动作的挑战; 为了克服这一局限,我们引入了一个新颖的框架,利用变异器模拟视频框架之间的长距离像素相关性; 此外,我们的网络配备了一个新型的跨规模窗口关注机制,使跨规模窗口相互作用;这一设计有效地扩大了可接收字段和集聚多尺度信息。 广泛的定量和定性实验表明,我们的方法在各种基准上取得了新的最新结果。