Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
翻译:通用的视频运动理解不仅涉及物体追踪,还需感知物体表面如何变形与运动。这类信息有助于推断三维形状、物理属性及物体间的相互作用。尽管针对较长视频片段中表面任意物理点的追踪问题已引起一定关注,但此前始终缺乏用于评估的数据集或基准。本文首先将问题形式化,命名为"任意点追踪"(TAP)。我们提出配套基准数据集TAP-Vid,包含两类视频:具有精确人工点轨迹标注的真实世界视频,以及具有完美真值点轨迹的合成视频。该基准数据集构建的核心创新在于一种新型半自动众包流程——利用光流估计补偿摄像机抖动等简单短时运动,使标注者能专注于视频中的复杂片段。我们通过合成数据验证了该流程的有效性,并提出了简易端到端点追踪模型TAP-Net,实验表明该模型在合成数据训练后,其性能超越了我们基准数据集上的所有现有方法。