Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.
翻译:近年来,视觉自回归($\mathsf{VAR}$)模型通过“由粗到细的下一尺度预测”范式,为图像生成领域引入了突破性进展,提供了一种可扩展的途径。假设 $n$ 表示 $\mathsf{VAR}$ 模型生成的最后一个 VQ 代码图的高度和宽度,[Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] 中提出的最先进算法需要 $O(n^{4+o(1)})$ 的时间,这在计算上是低效的。本文通过细粒度复杂度的视角,分析了 $\mathsf{VAR}$ 模型的计算极限与效率准则。我们的核心贡献在于识别了 $\mathsf{VAR}$ 计算能够实现亚二次时间复杂度的条件。我们证明了,在细粒度复杂度理论中的强指数时间假设($\mathsf{SETH}$)下,为 $\mathsf{VAR}$ 模型设计亚四次时间算法是不可能的。为了证实我们的理论发现,我们提出了利用低秩近似的高效构造方法,这些构造符合所推导的准则。本工作从理论视角开启了关于 $\mathsf{VAR}$ 模型计算效率的研究。我们的技术将为在 $\mathsf{VAR}$ 框架中推进可扩展且高效的图像生成提供启示。