Video frame interpolation is an increasingly important research task with several key industrial applications in the video coding, broadcast and production sectors. Recently, transformers have been introduced to the field resulting in substantial performance gains. However, this comes at a cost of greatly increased memory usage, training and inference time. In this paper, a novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time compared to existing transformer-based interpolation methods. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies. Quantitative evaluations are conducted on various benchmarks with complex motion to showcase the robustness of the proposed method, achieving competitive performance compared to state-of-the-art interpolation networks.
翻译:视频帧插值是一项日益重要的研究任务,在视频编码、广播和制作领域具有多项关键工业应用。近年来,Transformer被引入该领域,带来了显著的性能提升。然而,这以大幅增加内存占用、训练时间和推理时间为代价。本文提出了一种融合Transformer编码器与卷积特征的新方法。与现有基于Transformer的插值方法相比,该网络将内存负担降低了近50%,推理速度提升至四倍。我们引入了一种双编码器架构,该架构结合了卷积在建模局部相关性方面的优势与Transformer在捕捉长距离依赖关系方面的能力。通过在包含复杂运动的多个基准测试上进行定量评估,展示了所提方法的鲁棒性,在与最先进的插值网络相比时取得了具有竞争力的性能。