Critical research about camera-and-LiDAR-based semantic object segmentation for autonomous driving significantly benefited from the recent development of deep learning. Specifically, the vision transformer is the novel ground-breaker that successfully brought the multi-head-attention mechanism to computer vision applications. Therefore, we propose a vision-transformer-based network to carry out camera-LiDAR fusion for semantic segmentation applied to autonomous driving. Our proposal uses the novel progressive-assemble strategy of vision transformers on a double-direction network and then integrates the results in a cross-fusion strategy over the transformer decoder layers. Unlike other works in the literature, our camera-LiDAR fusion transformers have been evaluated in challenging conditions like rain and low illumination, showing robust performance. The paper reports the segmentation results over the vehicle and human classes in different modalities: camera-only, LiDAR-only, and camera-LiDAR fusion. We perform coherent controlled benchmark experiments of CLFT against other networks that are also designed for semantic segmentation. The experiments aim to evaluate the performance of CLFT independently from two perspectives: multimodal sensor fusion and backbone architectures. The quantitative assessments show our CLFT networks yield an improvement of up to 10% for challenging dark-wet conditions when comparing with Fully-Convolutional-Neural-Network-based (FCN) camera-LiDAR fusion neural network. Contrasting to the network with transformer backbone but using single modality input, the all-around improvement is 5-10%.
翻译:基于相机与激光雷达的自动驾驶语义目标分割关键研究显著受益于深度学习的最新进展。具体而言,视觉Transformer作为突破性创新,成功将多头注意力机制引入计算机视觉应用。为此,我们提出一种基于视觉Transformer的网络,用于执行面向自动驾驶语义分割的相机-激光雷达融合。我们的方案在双向网络上采用新颖的视觉Transformer渐进组装策略,随后通过Transformer解码器层的交叉融合策略整合结果。与现有研究不同,我们的相机-激光雷达融合Transformer已在雨雾及低照度等挑战性条件下完成评估,展现出鲁棒性能。本文报告了车辆与行人两类目标在不同模态(纯相机、纯激光雷达及相机-激光雷达融合)下的分割结果。我们针对CLFT与其他专为语义分割设计的网络进行了系统化基准实验。实验旨在从多模态传感器融合与骨干网络架构两个独立视角评估CLFT性能。定量分析表明:相较于全卷积神经网络(FCN)基的相机-激光雷达融合神经网络,我们的CLFT网络在暗光潮湿挑战场景下可实现高达10%的性能提升;与采用Transformer骨干但仅使用单模态输入的网络相比,综合性能提升达5-10%。