Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.
翻译:视觉扩散模型取得了显著进展,但由于缺乏高分辨率数据和计算资源受限,它们通常在有限分辨率下进行训练,这阻碍了其在更高分辨率下生成高保真图像或视频的能力。近期研究探索了免调优策略,以展现预训练模型在更高分辨率视觉生成方面尚未开发的潜力。然而,这些方法仍容易产生具有重复模式的低质量视觉内容。关键障碍在于,当模型生成超出其训练分辨率的视觉内容时,高频信息的增加不可避免,导致由累积误差产生的不良重复模式。为应对这一挑战,我们提出了FreeScale,一种免调优推理范式,通过尺度融合实现更高分辨率的视觉生成。具体而言,FreeScale处理来自不同感受野尺度的信息,然后通过提取所需频率分量进行融合。大量实验验证了我们的范式在扩展图像和视频模型更高分辨率生成能力方面的优越性。值得注意的是,与先前性能最佳的方法相比,FreeScale首次实现了8k分辨率图像的生成。