Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a na\"ive application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by $\mathbf{1.75}\times$ and $\mathbf{1.76}\times$, as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model.
翻译:自回归(AR)模型近年来在图像生成领域备受瞩目,其性能常可与扩散模型相媲美甚至超越。然而,AR模型的一个主要局限在于其顺序生成特性,即逐令牌处理,这导致其生成速度相较于生成对抗网络或扩散模型等更高效的方法更为缓慢。尽管推测解码已通过单次前向传播生成多个令牌,在加速大型语言模型方面被证明有效,但其在视觉AR模型中的应用仍鲜有探索。在本工作中,我们识别了该场景下的一个挑战,称之为**令牌选择模糊性**,即视觉AR模型常为令牌分配均匀的低概率,从而阻碍了推测解码的性能。为克服此挑战,我们提出了一种松弛的接受条件,称为LANTERN,该方法利用潜在空间中令牌的可互换性。这种松弛性通过更灵活地利用原本会被过早拒绝的候选令牌,恢复了推测解码在视觉AR模型中的有效性。此外,通过引入总变差距离界,我们确保了这些速度提升的实现不会显著损害图像质量或语义连贯性。实验结果表明,我们的方法在提供相较于推测解码的显著加速方面具有高效性。具体而言,与当前最先进的推测解码的简单应用相比,当应用于当代视觉AR模型LlamaGen时,LANTERN在贪婪解码和随机采样下分别实现了$\mathbf{1.75}\times$和$\mathbf{1.76}\times$的加速提升。