We tackle the long-standing challenge of reconstructing 3D structures and camera positions from videos. The problem is particularly hard when objects are transformed in a non-rigid way. Current approaches to this problem make unrealistic assumptions or require a long optimization time. We present TracksTo4D, a novel deep learning-based approach that enables inferring 3D structure and camera positions from dynamic content originating from in-the-wild videos using a single feed-forward pass on a sparse point track matrix. To achieve this, we leverage recent advances in 2D point tracking and design an equivariant neural architecture tailored for directly processing 2D point tracks by leveraging their symmetries. TracksTo4D is trained on a dataset of in-the-wild videos utilizing only the 2D point tracks extracted from the videos, without any 3D supervision. Our experiments demonstrate that TracksTo4D generalizes well to unseen videos of unseen semantic categories at inference time, producing equivalent results to state-of-the-art methods while significantly reducing the runtime compared to other baselines.
翻译:我们解决了从视频中重建三维结构与相机位置的长期挑战。当物体以非刚性方式变形时,该问题尤为困难。现有方法要么做出不切实际的假设,要么需要漫长的优化时间。我们提出TracksTo4D——一种基于深度学习的新方法,能通过单次前向传播处理稀疏点轨迹矩阵,从而从野外视频的动态内容中推断三维结构与相机位置。为实现这一目标,我们利用二维点追踪领域的最新进展,设计了一种等变神经架构,通过利用点轨迹的对称性直接处理二维点轨迹。TracksTo4D仅使用从野外视频中提取的二维点轨迹进行训练,无需任何三维监督。实验表明,TracksTo4D在推理时能良好泛化至未见过的语义类别视频,其生成结果与最先进方法相当,同时与其他基线方法相比显著缩短了运行时间。