Tracking with Human-Intent Reasoning

Advances in perception modeling have significantly improved the performance of object tracking. However, the current methods for specifying the target object in the initial frame are either by 1) using a box or mask template, or by 2) providing an explicit language description. These manners are cumbersome and do not allow the tracker to have self-reasoning ability. Therefore, this work proposes a new tracking task -- Instruction Tracking, which involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. To achieve this, we investigate the integration of knowledge and reasoning capabilities from a Large Vision-Language Model (LVLM) for object tracking. Specifically, we propose a tracker called TrackGPT, which is capable of performing complex reasoning-based tracking. TrackGPT first uses LVLM to understand tracking instructions and condense the cues of what target to track into referring embeddings. The perception component then generates the tracking results based on the embeddings. To evaluate the performance of TrackGPT, we construct an instruction tracking benchmark called InsTrack, which contains over one thousand instruction-video pairs for instruction tuning and evaluation. Experiments show that TrackGPT achieves competitive performance on referring video object segmentation benchmarks, such as getting a new state-of the-art performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also demonstrates a superior performance of instruction tracking under new evaluation protocols. The code and models are available at \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.

翻译：感知建模的进步显著提升了目标跟踪的性能。然而，当前在初始帧指定目标对象的方法要么是：1）使用边界框或掩码模板，要么是2）提供明确的语言描述。这些方式较为繁琐，且无法使跟踪器具备自我推理能力。为此，本文提出一种新的跟踪任务——指令跟踪（Instruction Tracking），该任务要求跟踪器根据隐式跟踪指令在视频帧中自动执行跟踪。为实现这一目标，我们探索了将大型视觉语言模型（LVLM）的知识与推理能力集成到目标跟踪中。具体而言，我们提出了名为TrackGPT的跟踪器，它能够执行基于复杂推理的跟踪。TrackGPT首先利用LVLM理解跟踪指令，并将需要跟踪的目标线索压缩为参照嵌入（referring embeddings），随后感知组件基于这些嵌入生成跟踪结果。为评估TrackGPT的性能，我们构建了指令跟踪基准InsTrack，该基准包含超过一千个指令-视频对，用于指令调优与评估。实验表明，TrackGPT在参照视频目标分割基准上取得了具有竞争力的性能，例如在Refer-DAVIS上以66.5 $\mathcal{J}\&\mathcal{F}$的指标创下新的最优结果。同时，在新评估协议下，它展现出卓越的指令跟踪性能。代码与模型已开源至 \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}。