Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project page: https://trace-anything.github.io/.
翻译:有效的时空表示是建模、理解与预测视频动态的基础。视频的原子单元——像素——随时间推移描绘出一条连续的三维轨迹,构成了动态的基本元素。基于此原理,我们提出将任意视频表示为轨迹场:一种稠密映射,为每一帧中的每个像素分配一个随时间变化的连续三维轨迹函数。基于这一表示,我们提出了Trace Anything神经网络,该网络通过单次前向传播即可预测完整的轨迹场。具体而言,对于每一帧中的每个像素,我们的模型预测一组参数化轨迹(即B样条曲线)的控制点,从而得到其在任意查询时刻的三维位置。我们在包括新平台数据在内的大规模四维数据上训练了Trace Anything模型,实验结果表明:(i)Trace Anything在我们新构建的轨迹场估计基准测试中取得了最先进的性能,并在现有点追踪基准测试中表现出竞争力;(ii)得益于其单次前向传播范式,该模型无需迭代优化或辅助估计器即可实现显著的效率提升;(iii)该模型展现出涌现能力,包括目标条件操控、运动预测和时空融合。项目页面:https://trace-anything.github.io/。