Foundational Models Defining a New Era in Vision: A Survey and Outlook

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.

翻译：视觉系统旨在观察并推理视觉场景的构成本质，这对理解我们周围的世界至关重要。物体及其位置之间的复杂关系、真实环境中的歧义性和变化，可以通过人类语言（自然受语法规则支配）以及音频、深度等其他模态得到更好描述。通过学习弥合此类模态之间差距的模型，结合大规模训练数据，有助于在测试时实现情境推理、泛化和提示能力。这些模型被称为基础模型。此类模型的输出可通过人类提供的提示进行修改而无需重新训练，例如：通过提供边界框来分割特定对象、通过询问图像或视频场景的问题进行交互式对话、或通过语言指令操控机器人行为。在本综述中，我们对这类新兴基础模型进行了全面回顾，包括组合不同模态（视觉、文本、音频等）的典型架构设计、训练目标（对比式、生成式）、预训练数据集、微调机制，以及常见的提示模式（文本型、视觉型与异构型）。我们探讨了计算机视觉中基础模型面临的开放性挑战与研究方向，包括其评估与基准测试的困难、真实世界理解中的差距、情境理解的局限性、偏差、对对抗攻击的脆弱性以及可解释性问题。我们系统且全面地回顾了该领域的最新进展，涵盖了基础模型的广泛应用。本工作所研究的基础模型完整列表可参见 \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日