High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low- to high-resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. We show that this technique drastically reduces memory usage at inference time and also allows us to use a single model at test time, solving both frame interpolation and spatial up-sampling, saving training cost. We show that HiFI helps significantly with high resolution and complex repeated textures that require global context. HiFI demonstrates comparable or beyond state-of-the-art performance on multiple benchmarks (Vimeo, Xiph, X-Test, SEPE-8K). On our newly introduced dataset that focuses on particularly challenging cases, HiFI also significantly outperforms other baselines on these cases. Please visit our project page for video results: https://hifi-diffusion.github.io

翻译：尽管近期取得了进展，但现有的帧插值方法在处理极高分辨率输入及应对重复纹理、细薄物体和大运动等挑战性场景时仍存在困难。为解决这些问题，我们提出了一种基于补丁的级联像素扩散模型HiFI，该模型在这些场景中表现优异，同时在标准基准测试中达到了具有竞争力的性能。级联方法通过从低分辨率到高分辨率生成一系列图像，能有效处理需要全局上下文获取粗略解和细节上下文实现高分辨率输出的大幅度或复杂运动。然而，与先前在逐级增大分辨率上进行扩散的级联扩散模型不同，我们采用单一模型始终在相同分辨率执行扩散，并通过处理输入图像和先验解的图像块来实现上采样。研究表明，该技术能大幅降低推理时的内存占用，并允许我们在测试阶段使用单一模型同时解决帧插值和空间上采样任务，从而节省训练成本。实验证明HiFI在处理需要全局上下文的高分辨率及复杂重复纹理方面效果显著。在多个基准数据集（Vimeo、Xiph、X-Test、SEPE-8K）上，HiFI展现出与当前最优方法相当或更优的性能。在我们新构建的专注于极端挑战性场景的数据集上，HiFI同样显著优于其他基线方法。视频结果请访问项目页面：https://hifi-diffusion.github.io