Can Large Pre-trained Models Help Vision Models on Perception Tasks?

The recent upsurge in pre-trained large models (e.g. GPT-4) has swept across the entire deep learning community. Such powerful large language models (LLMs) demonstrate advanced generative ability and multimodal understanding capability, which quickly achieve new state-of-the-art performances on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks, including context reasoning, article analysis and image content comprehension. However, considering the prohibitively high memory and computational cost for implementing such a large model, the conventional models (such as CNN and ViT), are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models for perception tasks (e.g. image classification) by taking advantage of large pre-trained models. We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations and achieve better performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptive text for all training images. Furthermore, we feed these detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images. During training, text embeddings will serve as extra supervising signals and be aligned with image representations learned by vision models. The alignment process helps vision models learn better and achieve higher accuracy with the assistance of pre-trained LLMs. We conduct extensive experiments to verify that the proposed algorithm consistently improves the performance for various vision models with heterogeneous architectures.

翻译：近期，预训练大型模型（如GPT-4）的兴起已席卷整个深度学习领域。此类强大的大语言模型展现出先进的生成能力与多模态理解能力，迅速在多个基准测试中创下新的最优性能。预训练大语言模型通常扮演通用人工智能模型的角色，可执行上下文推理、文章分析和图像内容理解等多种任务。然而，鉴于部署此类大型模型所需的高昂内存与计算成本，传统模型（如CNN和ViT）对于众多视觉感知任务仍不可或缺。本文提出利用大型预训练模型增强普通视觉模型在感知任务（如图像分类）中的表征能力。我们提出一种新型学习范式，利用从大型预训练模型中提取的知识，帮助CNN和ViT等模型学习增强表征并提升性能。首先，我们通过提示多模态大语言模型为所有训练图像生成描述性文本，从而整理出一套高质量描述集。随后，将这些详细描述输入预训练编码器，提取蕴含丰富语义信息的文本嵌入，该嵌入编码了图像内容。在训练过程中，文本嵌入将作为额外监督信号，与视觉模型学习的图像表征进行对齐。该对齐过程有助于视觉模型在预训练大语言模型的辅助下实现更优学习并提升准确率。我们通过广泛实验验证，所提算法能够一致提升多种异构架构视觉模型的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

29+阅读 · 2019年11月8日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日