Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.
翻译:视觉系统旨在观察并推理视觉场景的构成本质,这对理解我们周围的世界至关重要。物体及其位置之间的复杂关系、真实环境中的歧义性和变化,可以通过人类语言(自然受语法规则支配)以及音频、深度等其他模态得到更好描述。通过学习弥合此类模态之间差距的模型,结合大规模训练数据,有助于在测试时实现情境推理、泛化和提示能力。这些模型被称为基础模型。此类模型的输出可通过人类提供的提示进行修改而无需重新训练,例如:通过提供边界框来分割特定对象、通过询问图像或视频场景的问题进行交互式对话、或通过语言指令操控机器人行为。在本综述中,我们对这类新兴基础模型进行了全面回顾,包括组合不同模态(视觉、文本、音频等)的典型架构设计、训练目标(对比式、生成式)、预训练数据集、微调机制,以及常见的提示模式(文本型、视觉型与异构型)。我们探讨了计算机视觉中基础模型面临的开放性挑战与研究方向,包括其评估与基准测试的困难、真实世界理解中的差距、情境理解的局限性、偏差、对对抗攻击的脆弱性以及可解释性问题。我们系统且全面地回顾了该领域的最新进展,涵盖了基础模型的广泛应用。本工作所研究的基础模型完整列表可参见 \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}。