SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.

翻译：视觉自回归（VAR）建模因其创新的下一尺度预测范式而受到广泛关注。然而，主流VAR范式在每个自回归步骤中都会关注历史所有尺度的全部令牌。随着下一尺度分辨率的增长，注意力计算复杂度随分辨率呈四次方增加，导致显著的延迟。先前的加速方法通常跳过高分辨率尺度，这虽然加快了推理速度，但丢弃了高频细节并损害了图像质量。为解决这些问题，我们提出了SparVAR，一种免训练的加速框架，该框架利用了VAR注意力的三个特性：（i）强注意力汇聚点，（ii）跨尺度激活相似性，以及（iii）显著的局部性。具体而言，我们从一个稀疏决策尺度动态预测后续高分辨率尺度的稀疏注意力模式，并通过高效的索引映射机制构建尺度自相似稀疏注意力，从而实现在大尺度下的高效稀疏注意力计算。此外，我们提出了跨尺度局部稀疏注意力，并实现了一个高效的块状稀疏内核，其前向速度比FlashAttention快$\mathbf{> 5\times}$。大量实验表明，所提出的SparseVAR能够将生成$1024\times1024$高分辨率图像的8B模型的生成时间减少至1秒，且无需跳过最后几个尺度。与使用FlashAttention加速的VAR基线相比，我们的方法在保留几乎所有高频细节的同时实现了$\mathbf{1.57\times}$的加速。当与现有的尺度跳过策略结合时，SparseVAR可获得高达$\mathbf{2.28\times}$的加速，同时保持具有竞争力的视觉生成质量。代码可在https://github.com/CAS-CLab/SparVAR获取。