Improving Visual Object Tracking through Visual Prompting

Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

翻译：学习一个判别性模型以区分目标与其周围干扰物是通用视觉目标跟踪的关键。由于主流跟踪器判别能力有限，针对干扰物进行动态目标表示适配具有挑战性。为解决此问题，我们提出了一种用于通用视觉目标跟踪的新型视觉提示机制（PiVOT）。PiVOT提出一个基于预训练基础模型CLIP的提示生成网络，用于自动生成并优化视觉提示，从而将基础模型的知识迁移至跟踪任务。CLIP提供广泛的类别级知识，而基于实例特定数据训练的跟踪器擅长识别独特的物体实例。因此，PiVOT首先编译一个突出潜在目标位置的视觉提示。为将CLIP的知识迁移至跟踪器，PiVOT利用CLIP，基于候选物体与参考模板在潜在目标间的相似性来优化视觉提示。一旦视觉提示被优化，它能更好地突出潜在目标位置，从而减少无关的提示信息。通过所提出的提示机制，跟踪器可在视觉提示的引导下生成改进的实例感知特征图，从而有效减少干扰物。所提方法在训练过程中不涉及CLIP，因此保持了相同的训练复杂度并保留了预训练基础模型的泛化能力。在多个基准测试上的大量实验表明，采用所提提示方法的PiVOT能够抑制干扰物体并增强跟踪器性能。