Trackers and video generators solve closely related problems: the former analyze motion, while the latter synthesize it. We show that this connection enables pretrained video diffusion models to perform zero-shot point tracking by simply prompting them to visually mark points as they move over time. We place a distinctively colored marker at the query point, then regenerate the rest of the video from an intermediate noise level. This propagates the marker across frames, tracing the point's trajectory. To ensure that the marker remains visible in this counterfactual generation, despite such markers being unlikely in natural videos, we use the unedited initial frame as a negative prompt. Through experiments with multiple image-conditioned video diffusion models, we find that these "emergent" tracks outperform those of prior zero-shot methods and persist through occlusions, often obtaining performance that is competitive with specialized self-supervised models.
翻译:跟踪器与视频生成器解决密切相关的问题:前者分析运动,而后者合成运动。我们表明,这一关联使得预训练视频扩散模型能够通过简单提示它们在时间推移中视觉标记点的移动,从而实现零样本点跟踪。我们在查询点放置颜色鲜明的标记,然后从中间噪声水平重新生成视频的其余部分。这一过程将标记跨帧传播,追踪点的轨迹。为确保标记在此类反事实生成中保持可见——尽管此类标记在自然视频中罕见——我们使用未编辑的初始帧作为负提示。通过对多个图像条件视频扩散模型的实验,我们发现这些"涌现"轨迹的性能优于先前的零样本方法,且能持续通过遮挡,其表现往往与专用自监督模型不相上下。