Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
翻译:视觉自回归(VAR)模型通过下一尺度预测生成图像,自然地实现了从粗到细、快速、高保真的合成过程,这与人类感知模式相符。然而在实践中,这种层级结构在推理时可能发生偏移——由于模型容量有限和误差累积,生成过程可能偏离其固有的从粗到细特性。我们从信息论视角重新审视这一局限,并推导出:确保每个尺度贡献出无法由先前尺度解释的高频内容,能够有效缓解训练与推理间的差异。基于此洞见,我们提出了比例空间引导(SSG),这是一种免训练、推理阶段的引导方法,能够在保持全局一致性的同时,将生成过程导向预期的层级结构。SSG强调目标高频信号——即从较粗糙先验中分离出的语义残差。为获取该先验,我们采用一种基于频域的原理性方法:离散空间增强(DSE),该方法通过频率感知构建来锐化并更好地分离语义残差。SSG可广泛应用于基于离散视觉标记的VAR模型,且不受标记化设计或条件模态的限制。实验表明,SSG在保持低延迟的同时,持续提升生成结果的保真度与多样性,揭示了从粗到细图像生成中尚未被充分利用的效率潜力。代码发布于 https://github.com/Youngwoo-git/SSG。