In existing Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM (Segment Anything Model), to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.
翻译:在现有视频帧插值(VFI)方法中,相邻帧间的运动估计起着关键作用。然而,现有方法的估计精度仍面临挑战,主要源于插值过程中识别相邻帧对应区域时固有的模糊性。因此,在运动估计前通过区域区分来提升精度至关重要。本文提出一种创新方案:利用开放世界分割模型(例如SAM,即Segment Anything Model),从不同帧中提取区域可区分先验(RDPs)。RDPs以空间变化高斯混合模型表示,能够以统一模态区分任意数量的区域。通过我们设计的即插即用式层次化区域感知特征融合模块(HRFFM),RDPs可被集成至现有基于运动的VFI方法中,以增强运动估计特征。HRFFM采用残差学习方式,通过RDP引导的特征归一化(RDPFN)将RDP融入VFI编码器的各层级阶段。借助HRFFM与RDP,VFI编码器内的特征对相邻帧中匹配区域呈现相似表征,从而改进中间帧的合成效果。大量实验表明,HRFFM能在多种场景下持续提升VFI性能。