Scene flow estimation aims to recover per-point motion from two adjacent LiDAR scans. However, in real-world applications such as autonomous driving, points rarely move independently of others, especially for nearby points belonging to the same object, which often share the same motion. Incorporating this locally rigid motion constraint has been a key challenge in self-supervised scene flow estimation, which is often addressed by post-processing or appending extra regularization. While these approaches are able to improve the rigidity of predicted flows, they lack an architectural inductive bias for local rigidity within the model structure, leading to suboptimal learning efficiency and inferior performance. In contrast, we enforce local rigidity with a lightweight add-on module in neural network design, enabling end-to-end learning. We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. Additionally, to ensure computational efficiency, we operate on pillars rather than points and learn representative features for voting per pillar. We plug the Voting Module into popular model designs and evaluate its benefit on Argoverse 2 and Waymo datasets. We outperform baseline works with only marginal compute overhead. Code is available at https://github.com/tudelft-iv/VoteFlow.
翻译:场景流估计旨在从两个相邻的LiDAR扫描中恢复逐点运动。然而,在自动驾驶等实际应用中,点很少独立于其他点移动,尤其是对于属于同一物体的邻近点,它们通常共享相同的运动。融入这种局部刚性运动约束一直是自监督场景流估计中的一个关键挑战,通常通过后处理或附加额外正则化来解决。尽管这些方法能够提高预测流的刚性,但它们在模型结构内部缺乏对局部刚性的架构归纳偏置,导致学习效率欠佳且性能较差。相比之下,我们通过在神经网络设计中引入一个轻量级的附加模块来强制执行局部刚性,从而实现端到端学习。我们设计了一个离散化的投票空间,该空间容纳所有可能的平移,然后通过可微分投票识别出邻近点共享的平移。此外,为确保计算效率,我们在柱体而非点上进行操作,并学习每个柱体用于投票的代表性特征。我们将投票模块嵌入到流行的模型设计中,并在Argoverse 2和Waymo数据集上评估其优势。我们仅以微小的计算开销即超越了基线工作。代码发布于https://github.com/tudelft-iv/VoteFlow。