Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds to process full-size point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feedforward methods are considerably faster, running on the order of tens to hundreds of milliseconds for full-size point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple, scalable distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feedforward model. Our instantiation of this framework, ZeroFlow, achieves state-of-the-art performance on the Argoverse 2 Self-Supervised Scene Flow Challenge while using zero human labels by simply training on large-scale, diverse unlabeled data. At test-time, ZeroFlow is over 1000x faster than label-free state-of-the-art optimization-based methods on full-size point clouds (34 FPS vs 0.028 FPS) and over 1000x cheaper to train on unlabeled data compared to the cost of human annotation (\$394 vs ~\$750,000). To facilitate further research, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets at https://vedder.io/zeroflow.html
翻译:场景流估计是描述时间上连续点云之间三维运动场任务。现有最优方法采用强先验知识和测试时优化技术,但处理全尺寸点云需要数十秒时间,使其无法作为开放世界目标检测等实时应用的计算机视觉基元。前馈方法速度显著更快,处理全尺寸点云仅需数十至数百毫秒,但需要昂贵的人工标注。为解决这两个局限性,我们提出"通过蒸馏的场景流估计"(Scene Flow via Distillation)——一种简单、可扩展的蒸馏框架,利用无标签优化方法生成伪标签以监督前馈模型。该框架的实例化ZeroFlow在Argoverse 2自监督场景流挑战中实现当前最优性能,同时仅通过在大规模多样化无标签数据上训练,无需任何人工标注。测试时,ZeroFlow在全尺寸点云上比基于优化的无标签最优方法快1000倍以上(34 FPS vs 0.028 FPS),且训练无标签数据的成本比人工标注低1000倍以上(394美元 vs 约75万美元)。为促进后续研究,我们在https://vedder.io/zeroflow.html 公开发布代码、训练模型权重以及Argoverse 2和Waymo Open数据集的高质量伪标签。