TAP-Vid: A Benchmark for Tracking Any Point in a Video

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

翻译：通用的视频运动理解不仅涉及物体追踪，还需感知物体表面如何变形与运动。这类信息有助于推断三维形状、物理属性及物体间的相互作用。尽管针对较长视频片段中表面任意物理点的追踪问题已引起一定关注，但此前始终缺乏用于评估的数据集或基准。本文首先将问题形式化，命名为"任意点追踪"（TAP）。我们提出配套基准数据集TAP-Vid，包含两类视频：具有精确人工点轨迹标注的真实世界视频，以及具有完美真值点轨迹的合成视频。该基准数据集构建的核心创新在于一种新型半自动众包流程——利用光流估计补偿摄像机抖动等简单短时运动，使标注者能专注于视频中的复杂片段。我们通过合成数据验证了该流程的有效性，并提出了简易端到端点追踪模型TAP-Net，实验表明该模型在合成数据训练后，其性能超越了我们基准数据集上的所有现有方法。

相关内容

TAP

关注 819

ACM应用感知TAP(ACM Transactions on Applied Perception)旨在通过发表有助于统一这些领域研究的高质量论文来增强计算机科学与心理学/感知之间的协同作用。该期刊发表跨学科研究，在跨计算机科学和感知心理学的任何主题领域都具有重大而持久的价值。所有论文都必须包含感知和计算机科学两个部分。主题包括但不限于：视觉感知：计算机图形学，科学/数据/信息可视化，数字成像，计算机视觉，立体和3D显示技术。听觉感知：听觉显示和界面，听觉听觉编码，空间声音，语音合成和识别。触觉：触觉渲染，触觉输入和感知。感觉运动知觉：手势输入，身体运动输入。感官感知：感官整合，多模式渲染和交互。官网地址：http://dblp.uni-trier.de/db/journals/tap/