OmniPT：释放大型视觉语言模型在行人跟踪与理解中的潜力 (OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding)

LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

翻译：大型视觉语言模型（LVLMs）在图像级任务（如视觉问答和图像描述）中已展现出卓越性能。然而，在许多实例级任务（如视觉定位和物体检测）中，LVLMs与先前的专家模型相比仍存在性能差距。同时，尽管行人跟踪是一项经典任务，但结合物体跟踪与自然语言的新兴研究方向不断涌现，例如Referring MOT、Cross-view Referring MOT和Semantic MOT。这些任务强调模型应在高级语义层面理解被跟踪对象，而这正是LVLMs的优势所在。本文提出了一种新的统一行人跟踪框架——OmniPT，该框架能够交互式地执行跟踪、基于参考的跟踪以及生成被跟踪对象的语义理解。我们解决了两个关键问题：如何将跟踪任务建模为基础模型可执行的任务，以及如何使模型输出格式化的答案。为此，我们设计了一个包含RL-Mid Training-SFT-RL的训练阶段。基于LVLM的预训练权重，首先执行简单的强化学习（RL）阶段，使模型能够输出固定且可监督的边界框格式。随后，利用大量行人相关数据集进行中期训练。最后，在多个行人跟踪数据集上进行监督微调，并执行另一个RL阶段以提升模型的跟踪性能并增强其指令遵循能力。我们在跟踪基准数据集上进行了实验，结果表明所提方法优于现有方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/