Neural implicit fields have been a de facto standard in novel view synthesis. Recently, there exist some methods exploring fusing multiple modalities within a single field, aiming to share implicit features from different modalities to enhance reconstruction performance. However, these modalities often exhibit misaligned behaviors: optimizing for one modality, such as LiDAR, can adversely affect another, like camera performance, and vice versa. In this work, we conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera joint synthesis, revealing the underlying issue lies in the misalignment of different sensors. Furthermore, we introduce AlignMiF, a geometrically aligned multimodal implicit field with two proposed modules: Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI). These modules effectively align the coarse geometry across different modalities, significantly enhancing the fusion process between LiDAR and camera data. Through extensive experiments across various datasets and scenes, we demonstrate the effectiveness of our approach in facilitating better interaction between LiDAR and camera modalities within a unified neural field. Specifically, our proposed AlignMiF, achieves remarkable improvement over recent implicit fusion methods (+2.01 and +3.11 image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer Distance on the respective datasets).
翻译:神经隐式场已成为新视角合成中的事实标准。近期,部分研究探索在单一场内融合多模态数据,旨在共享不同模态的隐式特征以提升重建性能。然而,这些模态常呈现不对齐行为:优化某一模态(如激光雷达)可能对另一模态(如相机性能)产生负面影响,反之亦然。本文对激光雷达-相机联合合成的多模态隐式场进行系统性分析,揭示其根本问题在于不同传感器间的不对齐。为此,我们提出几何对齐的多模态隐式场AlignMiF,并引入两个模块:几何感知对齐模块与共享几何初始化模块。这些模块有效对齐跨模态的粗粒度几何结构,显著增强了激光雷达与相机数据的融合过程。通过在不同数据集与场景中的广泛实验,我们证明了该方法在统一神经场内促进激光雷达与相机模态间更好交互的有效性。具体而言,所提出的AlignMiF在近期隐式融合方法基础上取得显著提升(在KITTI-360与Waymo数据集上图像PSNR分别提升2.01和3.11),并持续超越单一模态性能(分别使激光雷达倒角距离降低13.8%和14.2%)。