Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
翻译:自回归(AR)图像生成模型能够生成高保真度的图像,但由于其固有的序列化、逐令牌解码过程,通常存在推理速度慢的问题。推测性解码通过采用轻量级草稿模型来近似较大AR模型的输出,已在加速文本生成且不损失质量方面展现出潜力。然而,其在图像生成中的应用仍很大程度上未被充分探索。挑战主要源于显著更大的采样空间,这增加了草稿模型与目标模型输出对齐的复杂性,同时未能充分利用图像固有的二维空间结构,从而限制了对局部依赖关系的建模。为克服这些挑战,我们提出了Hawk,一种利用图像空间结构引导推测模型实现更准确、高效预测的新方法。在多个文本到图像基准测试上的实验结果表明,相较于标准AR模型,该方法实现了1.71倍的加速,同时保持了图像的保真度和多样性。