LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.
翻译:大型视觉语言模型(LVLMs)在图像级任务(如视觉问答和图像描述)中已展现出卓越性能。然而,在许多实例级任务(如视觉定位和物体检测)中,LVLMs与先前的专家模型相比仍存在性能差距。同时,尽管行人跟踪是一项经典任务,但结合物体跟踪与自然语言的新兴研究方向不断涌现,例如Referring MOT、Cross-view Referring MOT和Semantic MOT。这些任务强调模型应在高级语义层面理解被跟踪对象,而这正是LVLMs的优势所在。本文提出了一种新的统一行人跟踪框架——OmniPT,该框架能够交互式地执行跟踪、基于参考的跟踪以及生成被跟踪对象的语义理解。我们解决了两个关键问题:如何将跟踪任务建模为基础模型可执行的任务,以及如何使模型输出格式化的答案。为此,我们设计了一个包含RL-Mid Training-SFT-RL的训练阶段。基于LVLM的预训练权重,首先执行简单的强化学习(RL)阶段,使模型能够输出固定且可监督的边界框格式。随后,利用大量行人相关数据集进行中期训练。最后,在多个行人跟踪数据集上进行监督微调,并执行另一个RL阶段以提升模型的跟踪性能并增强其指令遵循能力。我们在跟踪基准数据集上进行了实验,结果表明所提方法优于现有方法。