In self-driving applications, LiDAR data provides accurate information about distances in 3D but lacks the semantic richness of camera data. Therefore, state-of-the-art methods for perception in urban scenes fuse data from both sensor types. In this work, we introduce a novel self-supervised method to fuse LiDAR and camera data for self-driving applications. We build upon masked autoencoders (MAEs) and train deep learning models to reconstruct masked LiDAR data from fused LiDAR and camera features. In contrast to related methods that use birds-eye-view representations, we fuse features from dense spherical LiDAR projections and features from fish-eye camera crops with a similar field of view. Therefore, we reduce the learned spatial transformations to moderate perspective transformations and do not require additional modules to generate dense LiDAR representations. Code is available at: https://github.com/KIT-MRT/masked-fusion-360
翻译:在自动驾驶应用中,激光雷达数据提供了精确的三维距离信息,但缺乏摄像头数据的语义丰富性。因此,城市场景感知的最先进方法会融合两种传感器类型的数据。在本工作中,我们提出了一种新颖的自监督方法,用于融合激光雷达和摄像头数据以应用于自动驾驶场景。我们基于掩码自编码器(MAEs)框架,训练深度学习模型从融合的激光雷达和摄像头特征中重建被遮蔽的激光雷达数据。与使用鸟瞰图表示的相关方法不同,我们融合了密集球面激光雷达投影的特征以及具有相似视场的鱼眼相机裁剪区域的特征。由此,我们将所学空间变换简化为适度的透视变换,且无需额外模块生成密集的激光雷达表示。代码地址:https://github.com/KIT-MRT/masked-fusion-360