Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.
翻译:大多数视觉识别研究在深度神经网络训练中高度依赖人工标注数据,且通常针对单个视觉识别任务分别训练独立的深度神经网络,导致视觉识别范式既费力又耗时。为解决这两个挑战,近年来视觉-语言模型(VLM)受到广泛研究。该模型通过从互联网上近乎无限可获取的大规模图像-文本对中学习丰富的视觉-语言关联,能够仅使用单个VLM在各类视觉识别任务上实现零样本预测。本文系统综述了面向多种视觉识别任务的视觉语言模型,具体包括:(1)介绍视觉识别范式发展的背景;(2)总结广泛采用的网络架构、预训练目标和下游任务的VLM基础;(3)VLM预训练与评估中广泛使用的数据集;(4)现有VLM预训练方法、迁移学习方法和知识蒸馏方法的综述与分类;(5)所综述方法的基准测试、分析与讨论;(6)未来VLM视觉识别研究中可探索的若干研究挑战与潜在研究方向。本综述配套项目已创建于https://github.com/jingyi0000/VLM_survey。