Multi-modal fusion is increasingly being used for autonomous driving tasks, as images from different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. In this study, we propose a novel fusion architecture called skip-cross networks (SkipcrossNets), which combines adaptively LiDAR point clouds and camera images without being bound to a certain fusion epoch. Specifically, skip-cross connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two data pipelines, providing a complementary effect for sparse point cloud features during fusion processes. The network is also divided into several blocks to reduce the complexity of feature fusion and the number of model parameters. The advantages of skip-cross fusion were demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could be viable for mobile terminals and embedded devices.
翻译:多模态融合正越来越多地被用于自动驾驶任务,因为来自不同模态的图像为特征提取提供了独特的信息。然而,现有的双流网络仅在特定网络层进行融合,这需要大量手动尝试来设置。随着CNN层次加深,两种模态的特征变得越来越高级和抽象,融合发生在特征层且差距较大,容易影响性能。在本研究中,我们提出了一种新颖的融合架构——跳跨网络(SkipcrossNets),该架构自适应地结合LiDAR点云和相机图像,而不受限于特定的融合阶段。具体而言,跳跨连接以前馈方式将每一层与每一层相连,对于每一层,所有先前层的特征图作为输入,而该层自身的特征图作为另一模态所有后续层的输入,从而增强特征传播和多模态特征融合。该策略有助于从两条数据流水线中选择最相似的特征层,为融合过程中稀疏点云特征提供互补效果。网络还划分为多个模块,以降低特征融合的复杂性和模型参数数量。通过在KITTI和A2D2数据集上的应用,跳跨融合的优势得到了验证,在KITTI上达到了96.85%的MaxF分数,在A2D2上达到了84.84%的F1分数。模型参数仅需2.33 MB内存,速度为68.24 FPS,可适用于移动终端和嵌入式设备。