Multimodal large language models (MLLM) have achieved satisfactory results in many tasks. However, their performance in the task of person re-identification (ReID) has not been explored to date. This paper will investigate how to adapt them for the task of ReID. An intuitive idea is to fine-tune MLLM with ReID image-text datasets, and then use their visual encoder as a backbone for ReID. However, there still exist two apparent issues: (1) Designing instructions for ReID, MLLMs may overfit specific instructions, and designing a variety of instructions will lead to higher costs. (2) Latent image feature vectors from LLMs are not involved in loss computation. Instructional learning, aligning image-text features, results in indirect optimization and a learning objective that inadequately utilizes features, limiting effectiveness in person feature learning. To address these problems, this paper proposes MLLMReID: Multimodal Large Language Model-based ReID. Firstly, we proposed Common Instruction, a simple approach that leverages the essence ability of LLMs to continue writing, avoiding complex and diverse instruction design. Secondly, we proposed DirectReID, which effectively employs the latent image feature vectors of images outputted by LLMs in ReID tasks. The experimental results demonstrate the superiority of our method. We will open-source the code on GitHub.
翻译:多模态大语言模型(MLLM)已在多项任务中取得令人满意的结果,但其在行人重识别(ReID)任务中的表现至今尚未被探索。本文旨在研究如何将其适配至ReID任务。一个直观的想法是通过ReID图像-文本数据集对MLLM进行微调,随后将其视觉编码器作为ReID的主干网络。然而,仍存在两个明显问题:(1)为ReID设计指令时,MLLM可能过拟合特定指令,而设计多样化指令将导致较高成本。(2)来自大语言模型的潜在图像特征向量未参与损失计算。基于指令学习对齐图像-文本特征的间接优化方式,其学习目标未能充分利用特征,限制了人物特征学习的效果。为解决上述问题,本文提出MLLMReID:基于多模态大语言模型的行人重识别方法。首先,我们提出通用指令(Common Instruction),这是一种利用大语言模型续写本质能力的简单方法,避免了复杂多样的指令设计。其次,我们提出DirectReID方法,有效利用了模型输出的图像潜在特征向量在ReID任务中的作用。实验结果表明了本方法的优越性。相关代码将在GitHub开源。