Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds for large-scale point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feed forward methods are considerably faster, running on the order of tens to hundreds of milliseconds for large-scale point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feed forward model. Our instantiation of this framework, ZeroFlow, produces scene flow estimates in real-time on large-scale point clouds at quality competitive with state-of-the-art methods while using zero human labels. Notably, at test-time ZeroFlow is over 1000$\times$ faster than label-free state-of-the-art optimization-based methods on large-scale point clouds and over 1000$\times$ cheaper to train on unlabeled data compared to the cost of human annotation of that data. To facilitate research reuse, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets.
翻译:论文摘要:场景流估计旨在描述时间序列中点云之间的三维运动场。现有最先进方法虽采用强先验与测试时优化技术,但处理大规模点云需要数十秒量级的计算时间,这使得它们无法作为开放世界目标检测等实时应用的计算机视觉基元。前馈方法速度显著提升(处理大规模点云仅需数十至数百毫秒),却依赖昂贵的人工标注数据。为同时克服上述局限,我们提出"基于知识蒸馏的场景流估计"(Scene Flow via Distillation)框架——通过简单的蒸馏机制,利用免标注优化方法生成伪标签来监督前馈模型。基于该框架实现的ZeroFlow系统可在大规模点云上实现实时场景流估计,其质量与最先进方法相当,且完全无需人工标注。值得注意的是,在测试阶段,ZeroFlow处理大规模点云的速度比基于优化的免标注最先进方法快1000倍以上;在训练阶段,其使用无标注数据的成本比人工标注同等规模数据低1000倍以上。为促进学术复用,我们开源了代码、预训练模型权重及针对Argoverse 2与Waymo Open数据集的高质量伪标签。