Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision, which perform next-token predictions by representing visual data as visual tokens and enables autoregressive modelling for a wide range of vision tasks, ranging from visual generation and visual understanding to the very recent multimodal generation that unifies visual generation and understanding with a single autoregressive model. This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods and highlighting their major contributions, strengths, and limitations, covering various vision tasks such as image generation, video generation, image editing, motion generation, medical image analysis, 3D generation, robotic manipulation, unified multimodal generation, etc. Besides, we investigate and analyze the latest advancements in autoregressive models, including thorough benchmarking and discussion of existing methods across various evaluation datasets. Finally, we outline key challenges and promising directions for future research, offering a roadmap to guide further advancements in vision autoregressive models.
翻译:自回归模型在自然语言处理领域展现出卓越的性能,其可扩展性、适应性和泛化能力令人瞩目。受其在自然语言处理领域显著成功的启发,自回归模型近年来在计算机视觉领域得到深入研究。这类模型通过将视觉数据表示为视觉标记来实现下一标记预测,从而为广泛的视觉任务(从视觉生成、视觉理解到近期统一视觉生成与理解的多模态生成任务)实现了自回归建模。本文系统综述了视觉自回归模型的发展,包括建立现有方法的分类体系,并重点阐述其主要贡献、优势与局限性,涵盖图像生成、视频生成、图像编辑、运动生成、医学图像分析、三维生成、机器人操作、统一多模态生成等多种视觉任务。此外,我们深入探究并分析了自回归模型的最新进展,包括对现有方法在不同评估数据集上的全面基准测试与讨论。最后,我们指出未来研究的关键挑战与潜在方向,为视觉自回归模型的进一步发展提供路线图。