Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at https://github.com/xypeng9903/LDF-VFI.
翻译:现有的视频帧插值方法通常采用以帧为中心的处理方式,将视频视为独立的短片段(例如三帧组)进行处理,这会导致时间不一致性和运动伪影。为克服此问题,我们提出了一种以视频为中心的整体性范式,称为**L**ocal **D**iffusion **F**orcing for **V**ideo **F**rame **I**nterpolation(LDF-VFI)。我们的框架建立在自回归扩散Transformer之上,该模型对整个视频序列进行建模,以确保长程时间连贯性。为缓解自回归生成中固有的误差累积问题,我们引入了一种新颖的跳跃连接采样策略,该策略能有效保持时间稳定性。此外,LDF-VFI结合了稀疏局部注意力与分块VAE编码,这种组合不仅能够高效处理长序列,还允许在推理时无需重新训练即可泛化至任意空间分辨率(例如4K)。一个增强的条件VAE解码器利用了输入视频的多尺度特征,进一步提升了重建保真度。实验表明,LDF-VFI在具有挑战性的长序列基准测试中取得了最先进的性能,展现出卓越的单帧质量和时间一致性,尤其是在包含大运动的场景中。源代码可在 https://github.com/xypeng9903/LDF-VFI 获取。