Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation models. In this survey, we review the recent advances and discuss future directions for autoregressive vision foundation models. First, we present the trend for next generation of vision foundation models, i.e., unifying both understanding and generation in vision tasks. We then analyze the limitations of existing vision foundation models, and present a formal definition of autoregression with its advantages. Later, we categorize autoregressive vision foundation models from their vision tokenizers and autoregression backbones. Finally, we discuss several promising research challenges and directions. To the best of our knowledge, this is the first survey to comprehensively summarize autoregressive vision foundation models under the trend of unifying understanding and generation. A collection of related resources is available at https://github.com/EmmaSRH/ARVFM.
翻译:大型语言模型(LLM)中的自回归机制通过将所有语言任务统一至下一词元预测范式,展现出卓越的可扩展性。近期,将这一成功扩展至视觉基础模型的研究兴趣日益增长。本综述回顾了自回归视觉基础模型的最新进展并探讨了其未来发展方向。首先,我们阐述了下一代视觉基础模型的趋势,即在视觉任务中统一理解与生成能力。随后,我们分析了现有视觉基础模型的局限性,并给出了自回归的形式化定义及其优势。接着,我们从视觉词元化器与自回归主干架构两个维度对现有自回归视觉基础模型进行分类。最后,我们探讨了若干具有前景的研究挑战与方向。据我们所知,这是首篇在统一理解与生成趋势下系统综述自回归视觉基础模型的文献。相关资源合集请访问:https://github.com/EmmaSRH/ARVFM。