Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
翻译:视频帧插值旨在恢复观测帧之间真实缺失的帧,从而从低帧率视频生成高帧率视频。然而,在没有额外引导信息的情况下,帧间的大幅度运动会使该问题成为不适定问题。基于事件的视频帧插值通过利用稀疏、高时间分辨率的事件测量数据作为运动引导来解决这一挑战。这种引导使得EVFI方法的性能显著优于仅使用帧数据的方法。然而迄今为止,EVFI方法一直依赖于有限的事件-帧配对训练数据集,这严重限制了其性能与泛化能力。在本研究中,我们通过将基于互联网规模数据集预训练的视频扩散模型适配至EVFI任务,克服了数据有限的挑战。我们在真实世界的EVFI数据集上进行了实验验证,包括我们新引入的数据集。实验结果表明,我们的方法在性能上超越了现有方法,并且在跨相机泛化能力方面远优于现有方案。