Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point's trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these "emergent" tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.
翻译:追踪器与视频生成器解决的是密切相关的问题:前者分析运动,后者合成运动。本文证明,这种关联性使得预训练的视频扩散模型能够通过简单的视觉提示——随时间推移标记移动点——实现零样本点追踪。我们在查询点处放置一个颜色独特的标记,然后从中间噪声级别重新生成视频的其余部分。这一操作使标记在帧间传播,从而描绘出点的运动轨迹。为确保标记在此反事实生成过程中保持可见(尽管此类标记在自然视频中出现的可能性较低),我们使用未经编辑的初始帧作为负向提示。通过对多个图像条件视频扩散模型(如Stable Video Diffusion)的实验,我们发现这些“涌现”的轨迹优于先前的零样本方法,并能持续穿透遮挡,其性能常可与专用自监督模型相媲美。