The recent upsurge in pre-trained large models (e.g. GPT-4) has swept across the entire deep learning community. Such powerful large language models (LLMs) demonstrate advanced generative ability and multimodal understanding capability, which quickly achieve new state-of-the-art performances on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks, including context reasoning, article analysis and image content comprehension. However, considering the prohibitively high memory and computational cost for implementing such a large model, the conventional models (such as CNN and ViT), are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models for perception tasks (e.g. image classification) by taking advantage of large pre-trained models. We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations and achieve better performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptive text for all training images. Furthermore, we feed these detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images. During training, text embeddings will serve as extra supervising signals and be aligned with image representations learned by vision models. The alignment process helps vision models learn better and achieve higher accuracy with the assistance of pre-trained LLMs. We conduct extensive experiments to verify that the proposed algorithm consistently improves the performance for various vision models with heterogeneous architectures.
翻译:近期,预训练大型模型(如GPT-4)的兴起已席卷整个深度学习领域。此类强大的大语言模型展现出先进的生成能力与多模态理解能力,迅速在多个基准测试中创下新的最优性能。预训练大语言模型通常扮演通用人工智能模型的角色,可执行上下文推理、文章分析和图像内容理解等多种任务。然而,鉴于部署此类大型模型所需的高昂内存与计算成本,传统模型(如CNN和ViT)对于众多视觉感知任务仍不可或缺。本文提出利用大型预训练模型增强普通视觉模型在感知任务(如图像分类)中的表征能力。我们提出一种新型学习范式,利用从大型预训练模型中提取的知识,帮助CNN和ViT等模型学习增强表征并提升性能。首先,我们通过提示多模态大语言模型为所有训练图像生成描述性文本,从而整理出一套高质量描述集。随后,将这些详细描述输入预训练编码器,提取蕴含丰富语义信息的文本嵌入,该嵌入编码了图像内容。在训练过程中,文本嵌入将作为额外监督信号,与视觉模型学习的图像表征进行对齐。该对齐过程有助于视觉模型在预训练大语言模型的辅助下实现更优学习并提升准确率。我们通过广泛实验验证,所提算法能够一致提升多种异构架构视觉模型的性能。