Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.
翻译:预训练的扩散模型在生成高质量图像方面表现出色,但其生成能力本质上受限于其原始训练分辨率。近期,免训练方法尝试通过在去噪过程中引入干预来克服这一限制;然而,这些方法会产生巨大的计算开销,通常需要超过五分钟才能生成一张4K图像。本文提出PixelRush,这是首个适用于实际高分辨率文本到图像生成的免调优框架。我们的方法建立在成熟的基于分块的推理范式之上,但消除了多次反演和再生成循环的需求。相反,PixelRush能够在低步数机制内实现高效的基于分块的去噪。为了解决在少步数生成中由分块融合引入的伪影,我们提出了一种无缝融合策略。此外,我们通过噪声注入机制缓解了过度平滑效应。PixelRush展现出卓越的效率,生成4K图像仅需约20秒,这比现有最先进方法快了10倍至35倍,同时保持了卓越的视觉保真度。大量实验验证了我们方法所实现的性能提升与输出质量。