SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/

翻译：视频超分辨率（VSR）旨在从低分辨率（LR）估计中恢复高质量视频帧，然而现有的大多数VSR方法在推理时如同黑盒：用户无法可靠地修正意外伪影，而只能接受模型生成的任何结果。本文提出一种新颖的交互式VSR框架SparkVSR，将稀疏关键帧作为一种简洁而富有表现力的控制信号。具体而言，用户可先使用任意现成的图像超分辨率（ISR）模型对少量关键帧进行超分辨率处理（或选择性处理），随后SparkVSR将关键帧先验信息传播至整个视频序列，同时保持与原始LR视频运动轨迹的关联。我们设计了一种关键帧条件化的潜空间-像素两阶段训练流程，通过融合LR视频潜变量与稀疏编码的高分辨率（HR）关键帧潜变量，实现鲁棒的跨空间传播学习与感知细节优化。在推理阶段，SparkVSR支持灵活的关键帧选择方案（人工指定、编解码器I帧提取或随机采样），并采用无参考引导机制持续平衡关键帧遵循度与盲复原能力，确保即使在参考关键帧缺失或不完善时仍能保持稳健性能。在多个VSR基准测试上的实验表明，该方法在时间一致性与复原质量方面均有显著提升，在CLIP-IQA、DOVER和MUSIQ指标上分别超越基线模型达24.6%、21.8%和5.6%，实现了可控的关键帧驱动视频超分辨率。此外，我们验证了SparkVSR作为通用交互式关键帧条件化视频处理框架的泛化能力，可直接应用于老电影修复、视频风格迁移等未见任务。项目页面详见：https://sparkvsr.github.io/