We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.
翻译:我们提出Diff-Tracker,一种利用预训练文本到图像扩散模型解决具有挑战性的无监督视觉跟踪任务的新方法。我们的核心思想是利用预训练扩散模型中封装的丰富知识,例如对图像语义和结构信息的理解,来处理无监督视觉跟踪。为此,我们设计了一个初始提示学习器,通过学习一个代表目标的提示,使扩散模型能够识别跟踪目标。此外,为了使提示能够动态适应目标的运动,我们提出了一个在线提示更新器。在五个基准数据集上的大量实验证明了我们提出方法的有效性,该方法也达到了最先进的性能。