The recent upsurge in pre-trained large models (e.g. GPT-4) has swept across the entire deep learning community. Such powerful large language models (LLMs) demonstrate advanced generative ability and multimodal understanding capability, which quickly achieve new state-of-the-art performances on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks, including context reasoning, article analysis and image content comprehension. However, considering the prohibitively high memory and computational cost for implementing such a large model, the conventional models (such as CNN and ViT), are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models for perception tasks (e.g. image classification) by taking advantage of large pre-trained models. We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations and achieve better performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptive text for all training images. Furthermore, we feed these detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images. During training, text embeddings will serve as extra supervising signals and be aligned with image representations learned by vision models. The alignment process helps vision models learn better and achieve higher accuracy with the assistance of pre-trained LLMs. We conduct extensive experiments to verify that the proposed algorithm consistently improves the performance for various vision models with heterogeneous architectures.
翻译:近期,以GPT-4为代表的大型预训练模型(如GPT-4)的兴起席卷了整个深度学习领域。这类强大的大型语言模型展现了卓越的生成能力和多模态理解能力,在多项基准测试中迅速取得了最先进的性能。预训练的大型语言模型通常作为通用人工智能模型,可执行包括上下文推理、文章分析和图像内容理解在内的多种任务。然而,考虑到部署此类大模型所需的高昂内存和计算成本,传统模型(如CNN和ViT)对于许多视觉感知任务仍不可或缺。本文提出借助大型预训练模型,增强普通视觉模型在感知任务(如图像分类)中的表征能力。我们提出一种新的学习范式,利用从大型预训练模型中提取的知识,帮助CNN和ViT等模型学习增强表征,从而获得更优性能。首先,我们通过提示多模态大语言模型为所有训练图像生成描述性文本,精心构建了一个高质量描述集。其次,将这些详细描述输入预训练编码器,提取富含语义信息的文本嵌入,该嵌入编码了图像内容。在训练过程中,文本嵌入将作为额外的监督信号,并与视觉模型学习的图像表征进行对齐。该对齐过程有助于视觉模型在预训练大语言模型的辅助下学习更优的表征,并实现更高的准确率。我们通过大量实验证明,所提出的算法能够持续提升多种异构架构视觉模型的性能。