Video frame interpolation (VFI) is a fundamental vision task that aims to synthesize several frames between two consecutive original video images. Most algorithms aim to accomplish VFI by using only keyframes, which is an ill-posed problem since the keyframes usually do not yield any accurate precision about the trajectories of the objects in the scene. On the other hand, event-based cameras provide more precise information between the keyframes of a video. Some recent state-of-the-art event-based methods approach this problem by utilizing event data for better optical flow estimation to interpolate for video frame by warping. Nonetheless, those methods heavily suffer from the ghosting effect. On the other hand, some of kernel-based VFI methods that only use frames as input, have shown that deformable convolutions, when backed up with transformers, can be a reliable way of dealing with long-range dependencies. We propose event-based video frame interpolation with attention (E-VFIA), as a lightweight kernel-based method. E-VFIA fuses event information with standard video frames by deformable convolutions to generate high quality interpolated frames. The proposed method represents events with high temporal resolution and uses a multi-head self-attention mechanism to better encode event-based information, while being less vulnerable to blurring and ghosting artifacts; thus, generating crispier frames. The simulation results show that the proposed technique outperforms current state-of-the-art methods (both frame and event-based) with a significantly smaller model size.
翻译:视频帧插值(VFI)是一项基础视觉任务,旨在在两个连续原始视频图像之间合成若干中间帧。大多数算法仅依赖关键帧完成VFI,这本质上是一个病态问题,因为关键帧通常无法提供场景中物体运动轨迹的精确信息。另一方面,事件相机能够提供视频关键帧之间的更精确信息。近期一些前沿的基于事件的方法采用事件数据优化光流估计,通过图像扭曲实现视频帧插值,但这些方法严重受困于鬼影效应。同时,仅使用帧输入的核基VFI方法表明:在Transformer支持下,可变形卷积能有效处理长程依赖关系。我们提出基于注意力机制的事件驱动视频帧插值方法(E-VFIA),这是一种轻量级核基方法。该方法通过可变形卷积将事件信息与标准视频帧融合,生成高质量插值帧。所提方法以高时间分辨率表征事件,并采用多头自注意力机制更优地编码事件信息,同时降低模糊和鬼影伪影的脆弱性,从而生成更清晰的帧。仿真结果表明,该技术以显著更小的模型尺寸超越了当前最先进的方法(包括基于帧和基于事件的方法)。