Recently, there has been a surge in the popularity of pre trained large language models (LLMs) (such as GPT-4), sweeping across the entire Natural Language Processing (NLP) and Computer Vision (CV) communities. These LLMs have demonstrated advanced multi-modal understanding capabilities and showcased strong performance across various benchmarks. The LLM has started to embody traits of artificial general intelligence, which holds vital guidance for enhancing brain-like characteristics within visual encoding models. Hence, This paper proposes a new multi-modal training paradigm, aligning with LLM, for encoding fMRI activity in visual cortex. Based on this paradigm, we trained an encoding model in fMRI data named the LLM-Visual Encoding Model (LLM-VEM). Specifically, we utilize LLM (miniGPT4) to generate descriptive text for all stimulus images, forming a high-quality textual description set. Moreover, we use the pre-trained text encoder (CLIP) to process these detailed descriptions, obtaining the text embedding features. Next, we use the contrast loss function to minimize the distance between the image embedding features and the text embedding features to complete the alignment operation of the stimulus image and text information. With the assistance of the pre-trained LLM, this alignment process facilitates better learning of the visual encoding model, resulting in higher precision. The final experimental results indicate that our training paradigm has significantly aided in enhancing the performance of the visual encoding model.
翻译:近期,预训练大语言模型(如GPT-4)的流行度激增,席卷了整个自然语言处理和计算机视觉领域。这些大语言模型展现出先进的多模态理解能力,并在各类基准测试中表现出强大性能。LLM已开始具备通用人工智能的特征,这对提升视觉编码模型中的类脑特性具有重要指导意义。因此,本文提出一种与LLM对齐的新型多模态训练范式,用于编码视觉皮层fMRI活动。基于该范式,我们在fMRI数据上训练了名为LLM-Visual Encoding Model (LLM-VEM) 的编码模型。具体而言,我们利用LLM (miniGPT4) 为所有刺激图像生成描述性文本,形成高质量文本描述集。同时,使用预训练文本编码器 (CLIP) 处理这些详细描述,获取文本嵌入特征。接着,通过对比损失函数最小化图像嵌入特征与文本嵌入特征之间的距离,完成刺激图像与文本信息的对齐操作。借助预训练LLM的辅助,该对齐过程有助于视觉编码模型更好地学习,从而获得更高精度。最终实验结果表明,我们的训练范式显著提升了视觉编码模型的性能。