We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
翻译:我们研究了基于Transformer的基础模型的基本极限,并将分析扩展到视觉自回归(VAR)Transformer。VAR代表了一种使用新颖、可扩展、由粗到精的"下一尺度预测"框架生成图像的重大进展。这些模型树立了新的质量标杆,在图像合成任务中取得了最先进的性能,超越了包括扩散Transformer在内的所有先前方法。我们的主要贡献在于证明:对于仅具有单头自注意力层和单插值层的单头VAR Transformer,VAR Transformer具有普适性。从统计角度,我们证明了此类简单的VAR Transformer能够普适逼近任意图像到图像的Lipschitz函数。此外,我们证明了基于流的自回归Transformer具有类似的逼近能力。我们的研究结果为设计高效且计算有效的VAR Transformer策略提供了重要原则,这些策略可用于将其效用扩展到图像生成及其他相关领域中更复杂的VAR模型。