MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

翻译：视觉生成的核心在于对视觉数据先验的高效建模。传统的下一令牌预测方法将过程定义为学习连续令牌的条件概率分布。近期，下一尺度预测方法重新定义该过程，以学习多尺度表示上的分布，显著降低了生成延迟。然而，这些方法将每个尺度的条件建立在所有先前尺度上，并要求每个令牌考虑所有先前令牌，表现出尺度与空间冗余。为通过减少冗余更好地建模分布，我们提出马尔可夫视觉自回归建模（MVAR），一种新颖的自回归框架，引入尺度与空间马尔可夫假设以降低条件概率建模的复杂度。具体而言，我们引入尺度马尔可夫轨迹，其仅将相邻前一尺度的特征作为输入进行下一尺度预测，从而能够采用并行训练策略，显著降低GPU内存消耗。此外，我们提出空间马尔可夫注意力，该机制将每个令牌的注意力限制在相邻尺度对应位置上大小为k的局部邻域内，而非关注这些尺度上的所有令牌，以追求降低建模复杂度。基于这些改进，我们将注意力计算的计算复杂度从O(N^2)降低至O(Nk)，使得仅需八块NVIDIA RTX 4090 GPU即可完成训练，并在推理时无需KV缓存。在ImageNet上的大量实验表明，MVAR无论是使用从头训练的小模型还是微调后的大模型，均能达到相当或更优的性能，同时将平均GPU内存占用降低了3.0倍。