The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of $0.8751 \mathrm{cm}$ and a mean rotation error of $0.0562 ^{\circ}$ on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.
翻译:激光雷达与相机融合在自动驾驶感知任务中的应用日益广泛。此类基于融合的算法性能在很大程度上取决于传感器标定的精度,但由于跨模态数据中共同特征难以识别,标定任务颇具挑战性。以往许多标定方法依赖特定目标物和/或人工干预,已被证明繁琐且成本高昂。尽管已提出基于学习的在线标定方法,但其性能在多数情况下仍难尽人意。这些方法通常面临特征图稀疏、跨模态关联不可靠、标定参数回归不准确等问题。为应对上述挑战,本文提出CalibFormer——一种用于激光雷达-相机自动标定的端到端网络。我们通过聚合多层相机与激光雷达图像特征,实现高分辨率表征;利用多头相关性模块更精确地识别特征间关联;最后采用Transformer架构从相关性信息中估计精确标定参数。该方法在KITTI数据集上实现了平均平移误差$0.8751 \mathrm{cm}$、平均旋转误差$0.0562 ^{\circ}$,超越了现有最先进方法,展现出强大的鲁棒性、精度与泛化能力。