To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to be able to track any point corresponding to a solid surface in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 66.4%, and TAP-Vid-Kinetics from 57.2% to 61.5%.
翻译:为了赋予模型对物理和运动更深的理解,使其能够感知固体表面在真实场景中的运动与形变是至关重要的。这可以通过任意点追踪(TAP)加以形式化定义,要求算法能够追踪视频中固体表面对应的任意点,并可能在时空上实现密集追踪。目前,TAP的大规模真实标注训练数据仅存在于仿真环境中,而仿真环境中的物体与运动种类有限。本文展示了如何利用大规模、无标注、未经筛选的真实世界数据,在最小化架构改动的情况下,通过自监督学生-教师设置改进TAP模型。我们在TAP-Vid基准测试上实现了超越现有结果的先进性能,例如,TAP-Vid-DAVIS的性能从61.3%提升至66.4%,TAP-Vid-Kinetics从57.2%提升至61.5%。