We introduce CoTracker, a transformer-based model that tracks a large number of 2D points in long video sequences. Differently from most existing approaches that track points independently, CoTracker tracks them jointly, accounting for their dependencies. We show that joint tracking significantly improves tracking accuracy and robustness, and allows CoTracker to track occluded points and points outside of the camera view. We also introduce several innovations for this class of trackers, including using token proxies that significantly improve memory efficiency and allow CoTracker to track 70k points jointly and simultaneously at inference on a single GPU. CoTracker is an online algorithm that operates causally on short windows. However, it is trained utilizing unrolled windows as a recurrent network, maintaining tracks for long periods of time even when points are occluded or leave the field of view. Quantitatively, CoTracker substantially outperforms prior trackers on standard point-tracking benchmarks.
翻译:本文提出CoTracker,一种基于Transformer的模型,用于在长视频序列中追踪大量二维点。与现有大多数独立追踪点的方法不同,CoTracker采用联合追踪策略,充分考虑点之间的依赖关系。研究表明,联合追踪能显著提升追踪精度与鲁棒性,并使CoTracker能够追踪被遮挡点及移出摄像机视域的点。针对此类追踪器,我们还引入多项创新技术,包括采用令牌代理机制,该机制大幅提升内存效率,使CoTracker在单GPU推理时能同时联合追踪7万个点。CoTracker是一种在线算法,基于短时间窗口进行因果推理。然而,其训练过程采用展开窗口作为循环网络,即使点被遮挡或离开视域,仍能维持长时间轨迹追踪。定量实验表明,在标准点追踪基准测试中,CoTracker显著优于现有追踪器。