Sensor fusion is critical to perception systems for task domains such as autonomous driving and robotics. Recently, the Transformer integrated with CNN has demonstrated high performance in sensor fusion for various perception tasks. In this work, we introduce a method for fusing data from camera and LiDAR. By employing Transformer modules at multiple resolutions, proposed method effectively combines local and global contextual relationships. The performance of the proposed method is validated by extensive experiments with two adversarial benchmarks with lengthy routes and high-density traffics. The proposed method outperforms previous approaches with the most challenging benchmarks, achieving significantly higher driving and infraction scores. Compared with TransFuser, it achieves 8% and 19% improvement in driving scores for the Longest6 and Town05 Long benchmarks, respectively.
翻译:传感器融合对于自动驾驶和机器人等任务领域的感知系统至关重要。近年来,Transformer与CNN的集成在各类感知任务的传感器融合中展现出高性能。本文提出一种融合相机与激光雷达数据的方法。通过在不同分辨率下使用Transformer模块,所提方法有效结合了局部与全局上下文关系。基于两个包含长距离路线和高密度交通的对抗性基准数据集进行的大量实验验证了该方法的性能。所提方法在最具挑战性的基准测试中超越了先前方法,显著提升了驾驶得分和违规得分。与TransFuser相比,该方法在最长6路线和Town05长路线测试中驾驶得分分别提升8%和19%。