We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness. We also provide several technical innovations, including the concept of virtual tracks, which allows CoTracker to track 70k points jointly and simultaneously. Furthermore, CoTracker operates causally on short windows (hence, it is suitable for online tasks), but is trained by unrolling the windows across longer video sequences, which enables and significantly improves long-term tracking. We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard benchmarks, often by a substantial margin.
翻译:我们提出CoTracker,一种基于Transformer的模型,能够在视频序列中联合追踪稠密点云。这与现有大多数独立追踪点云、忽略其相关性的主流方法不同。研究表明,联合追踪可显著提升追踪精度与鲁棒性。我们还提供了多项技术创新,包括虚拟轨迹概念,使CoTracker能同时联合追踪7万个点云。此外,CoTracker采用短窗口因果机制(因此适用于在线任务),但通过跨更长视频序列展开窗口进行训练,这使得长期追踪成为可能并显著提升其性能。我们展示了定性追踪的卓越效果——即使在点云被遮挡或移出视野时仍能实现长期追踪。定量评估表明,CoTracker在标准基准测试中显著优于所有最新追踪器,且通常具有较大性能优势。