In this paper we address image classification tasks leveraging knowledge encoded in Large Multimodal Models (LMMs). More specifically, we use the MiniGPT-4 model to extract semantic descriptions for the images, in a multimodal prompting fashion. In the current literature, vision language models such as CLIP, among other approaches, are utilized as feature extractors, using only the image encoder, for solving image classification tasks. In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions. Thus, we use both the image and text embeddings for solving the image classification task. The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.
翻译:本文探讨了如何利用大型多模态模型(LMMs)中编码的知识来解决图像分类任务。具体而言,我们采用MiniGPT-4模型,以多模态提示的方式为图像提取语义描述。现有文献中,视觉语言模型(如CLIP等)通常仅利用图像编码器作为特征提取器来处理图像分类任务。本文提出额外使用文本编码器,获取与MiniGPT-4生成的语义描述相对应的文本嵌入。因此,我们同时利用图像和文本嵌入来完成图像分类任务。在三个数据集上的实验评估验证了,通过利用基于LMM的知识能够有效提升分类性能。