In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
翻译:本文提出ZipAR,一种无需训练、即插即用的并行解码框架,用于加速自回归视觉生成。其动机源于对图像局部结构的观察:空间上相距较远的区域往往具有最小的相互依赖性。给定一组部分解码的视觉标记,除了原始行维度的下一标记预测方案外,对应于列维度中空间相邻区域的标记可以并行解码,从而实现“下一集合预测”范式。通过单次前向传播同时解码多个标记,生成图像所需的前向传播次数显著减少,从而大幅提升生成效率。实验表明,ZipAR可在Emu3-Gen模型上将模型前向传播次数减少高达91%,且无需任何额外的重新训练。