In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at https://github.com/MCG-NJU/CamLiFlow.
翻译:本文研究了从同步2D和3D数据中联合估计光流与场景流的问题。现有方法要么采用将联合任务拆分为独立阶段的复杂流水线,要么以"早期融合"或"晚期融合"方式融合2D和3D信息。这种"一刀切"方法存在无法充分利用各模态特性或最大化模态间互补性的困境。针对该问题,我们提出一种新型端到端框架,该框架包含2D和3D分支,并在特定层之间建立多个双向融合连接。与先前工作不同,我们采用基于点的3D分支提取激光雷达特征,因其能保留点云的几何结构。为融合密集图像特征与稀疏点特征,我们提出名为双向相机-激光雷达融合模块(Bi-CLFM)的可学习算子。我们实例化了两种双向融合流水线:一种基于金字塔粗到细架构(称为CamLiPWC),另一种基于循环全对场变换(称为CamLiRAFT)。在FlyingThings3D数据集上,CamLiPWC和CamLiRAFT均超越现有所有方法,3D端点误差相比最佳已发表结果降低高达47.9%。我们的最优模型CamLiRAFT在KITTI场景流基准上取得了4.26%的误差,以更少的参数在所有提交方法中排名第一。此外,我们的方法具有强大的泛化性能和处理非刚性运动的能力。代码开源在 https://github.com/MCG-NJU/CamLiFlow。