Video frame interpolation has been actively studied with the development of convolutional neural networks. However, due to the intrinsic limitations of kernel weight sharing in convolution, the interpolated frame generated by it may lose details. In contrast, the attention mechanism in Transformer can better distinguish the contribution of each pixel, and it can also capture long-range pixel dependencies, which provides great potential for video interpolation. Nevertheless, the original Transformer is commonly used for 2D images; how to develop a Transformer-based framework with consideration of temporal self-attention for video frame interpolation remains an open issue. In this paper, we propose Video Frame Interpolation Flow Transformer to incorporate motion dynamics from optical flows into the self-attention mechanism. Specifically, we design a Flow Transformer Block that calculates the temporal self-attention in a matched local area with the guidance of flow, making our framework suitable for interpolating frames with large motion while maintaining reasonably low complexity. In addition, we construct a multi-scale architecture to account for multi-scale motion, further improving the overall performance. Extensive experiments on three benchmarks demonstrate that the proposed method can generate interpolated frames with better visual quality than state-of-the-art methods.
翻译:视频帧插值随着卷积神经网络的发展得到了积极研究。然而,由于卷积中核权重共享的固有限制,其生成的插值帧可能丢失细节。相比之下,Transformer中的注意力机制能更好地区分每个像素的贡献,同时还能捕捉长程像素依赖关系,这为视频插值提供了巨大潜力。但原始Transformer通常用于二维图像处理,如何开发考虑时序自注意力的Transformer框架用于视频帧插值仍是待解决的问题。本文提出Video Frame Interpolation Flow Transformer,将光流中的运动动态融入自注意力机制。具体而言,我们设计了Flow Transformer Block,在光流引导下的匹配局部区域中计算时序自注意力,使框架适用于大运动插帧同时保持较低复杂度。此外,我们构建多尺度架构以处理多尺度运动,进一步提升了整体性能。在三个基准上的大量实验表明,该方法生成的插值帧在视觉质量上优于现有最先进方法。