Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds for large-scale point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feed forward methods are considerably faster, running on the order of tens to hundreds of milliseconds for large-scale point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feed forward model. Our instantiation of this framework, ZeroFlow, produces scene flow estimates in real-time on large-scale point clouds at quality competitive with state-of-the-art methods while using zero human labels. Notably, at test-time ZeroFlow is over 1000$\times$ faster than label-free state-of-the-art optimization-based methods on large-scale point clouds and over 1000$\times$ cheaper to train on unlabeled data compared to the cost of human annotation of that data. To facilitate research reuse, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets.
翻译:场景流估计是描述时间上连续点云之间三维运动场的任务。最新方法利用强先验知识和测试时优化技术,但在大规模点云上需要数十秒的处理时间,这使得它们无法作为实时应用(如开放世界目标检测)的计算机视觉基元。前馈方法速度显著更快,处理大规模点云仅需数十至数百毫秒,但依赖昂贵的人工标注。为解决这两方面局限,我们提出基于蒸馏的场景流——一种简单蒸馏框架,利用无标签优化方法生成伪标签以监督前馈模型。该框架的实例化模型ZeroFlow,能够在大规模点云上实时生成场景流估计,质量与最新方法相当,且完全不依赖人工标签。值得注意的是,在测试阶段,ZeroFlow在大规模点云上的速度比基于优化的无标签最新方法快超过1000倍,且在无标签数据上的训练成本比人工标注这些数据的成本低超过1000倍。为促进科研复用,我们开源了代码、训练好的模型权重,以及针对Argoverse 2和Waymo Open数据集的高质量伪标签。