While inference-time scaling has significantly enhanced generative quality in large language and diffusion models, its application to vector-quantized (VQ) visual autoregressive modeling (VAR) remains unexplored. We introduce VAR-Scaling, the first general framework for inference-time scaling in VAR, addressing the critical challenge of discrete latent spaces that prohibit continuous path search. We find that VAR scales exhibit two distinct pattern types: general patterns and specific patterns, where later-stage specific patterns conditionally optimize early-stage general patterns. To overcome the discrete latent space barrier in VQ models, we map sampling spaces to quasi-continuous feature spaces via kernel density estimation (KDE), where high-density samples approximate stable, high-quality solutions. This transformation enables effective navigation of sampling distributions. We propose a density-adaptive hybrid sampling strategy: Top-k sampling focuses on high-density regions to preserve quality near distribution modes, while Random-k sampling explores low-density areas to maintain diversity and prevent premature convergence. Consequently, VAR-Scaling optimizes sample fidelity at critical scales to enhance output quality. Experiments in class-conditional and text-to-image evaluations demonstrate significant improvements in inference process. The code is available at https://github.com/WD7ang/VAR-Scaling.
翻译:尽管推理时缩放已显著提升大型语言模型和扩散模型的生成质量,但其在矢量量化(VQ)视觉自回归建模(VAR)中的应用仍未被探索。本文提出VAR-Scaling——首个适用于VAR的通用推理时缩放框架,解决了离散潜在空间阻碍连续路径搜索的关键挑战。我们发现VAR缩放呈现两种不同的模式类型:通用模式与特定模式,其中后期特定模式有条件地优化早期通用模式。为克服VQ模型中离散潜在空间的障碍,我们通过核密度估计(KDE)将采样空间映射至准连续特征空间,其中高密度样本可近似稳定、高质量的解决方案。该转换实现了对采样分布的有效探索。我们提出一种密度自适应的混合采样策略:Top-k采样聚焦于高密度区域以保持分布模态附近的质量,而Random-k采样探索低密度区域以维持多样性并防止早熟收敛。因此,VAR-Scaling在关键尺度上优化样本保真度以提升输出质量。在类别条件生成和文生图评估中的实验表明,该框架在推理过程中实现了显著改进。代码已发布于https://github.com/WD7ang/VAR-Scaling。