Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.
翻译:文本条件扩散模型通过将提示作为语义先验,推动了图像与视频超分辨率技术的发展。然而,现代超分辨率流程通常依赖潜在空间分块处理以实现高分辨率扩展,此时单一的全局描述会导致提示欠定问题。粗略的全局提示往往遗漏局部细节(提示稀疏性)并提供局部无关的引导(提示误导性),而分类器无关引导机制可能进一步放大这种影响。本文提出分块提示——一种统一的图像与视频超分辨率框架,该框架为每个潜在分块生成专属提示,并在局部文本条件后验分布下执行超分辨率重建,从而提供高信息量的引导,以最小开销解决提示欠定问题。在高分辨率真实场景图像与视频上的实验表明,相较于全局提示基线方法,本方法在感知质量与文本对齐度方面均取得稳定提升,同时减少了伪影生成与分块级失真现象。