Point clouds are naturally sparse, while image pixels are dense. The inconsistency limits feature fusion from both modalities for point-wise scene flow estimation. Previous methods rarely predict scene flow from the entire point clouds of the scene with one-time inference due to the memory inefficiency and heavy overhead from distance calculation and sorting involved in commonly used farthest point sampling, KNN, and ball query algorithms for local feature aggregation. To mitigate these issues in scene flow learning, we regularize raw points to a dense format by storing 3D coordinates in 2D grids. Unlike the sampling operation commonly used in existing works, the dense 2D representation 1) preserves most points in the given scene, 2) brings in a significant boost of efficiency, and 3) eliminates the density gap between points and pixels, allowing us to perform effective feature fusion. We also present a novel warping projection technique to alleviate the information loss problem resulting from the fact that multiple points could be mapped into one grid during projection when computing cost volume. Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset.
翻译:点云天然具有稀疏性,而图像像素是密集的。这种不一致性限制了从两种模态进行点级场景流估计的特征融合。由于常见局部特征聚合算法(最远点采样、KNN和球查询)中涉及的距离计算与排序导致的低内存效率和沉重开销,现有方法几乎无法通过单次推理从场景的完整点云中预测场景流。为解决场景流学习中的这些问题,我们将原始点规则化为密集格式,通过将3D坐标存储在2D网格中实现。与现有工作中常用的采样操作不同,该密集2D表示:1)保留给定场景中的大部分点,2)显著提升效率,3)消除点与像素之间的密度差异,从而支持有效的特征融合。我们还提出一种新颖的扭曲投影技术,用于缓解计算代价体积时因投影过程中多个点可能映射到同一网格而导致的信息损失问题。充分的实验证明了我们方法的效率和有效性,在FlyingThings3D和KITTI数据集上均优于现有技术。