We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
翻译:我们提出了一种新颖的测试时优化方法,用于高效且稳健地在视频中追踪任意像素的任意时刻。最新的基于优化的顶尖追踪技术OmniMotion需要极其漫长的优化时间,使其无法应用于下游任务。OmniMotion对随机种子的选择敏感,导致收敛不稳定。为提升效率与稳健性,我们引入了一种新型可逆变形网络CaDeX++,该网络将函数表示分解为局部时空特征网格,并通过非线性函数增强耦合块的表达能力。CaDeX++不仅在架构设计中融入了更强的几何先验,还利用了视觉基础模型提供的归纳偏置。我们的系统利用单目深度估计表示场景几何,并通过引入DINOv2长程语义来优化目标函数,以约束优化过程。实验表明,与现有的基于优化的顶尖方法OmniMotion相比,我们的方法在训练速度(提升超过\textbf{10倍})、稳健性以及追踪精度上均取得了显著提升。