In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on nuScenes test set while maintaining fast inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https://github.com/junjie18/CMT.
翻译:本文提出一种鲁棒的3D检测器,命名为跨模态Transformer(CMT),用于端到端的多模态3D检测。无需显式的视图变换,CMT直接将图像和点云令牌作为输入,并直接输出精确的3D边界框。多模态令牌的空间对齐通过将3D点编码为多模态特征来实现。CMT的核心设计十分简洁,但其性能令人瞩目。在nuScenes测试集上,它实现了74.1%的NDS(单模型最优水平),同时保持快速推理速度。此外,即使激光雷达缺失,CMT仍具有强鲁棒性。代码已发布于https://github.com/junjie18/CMT。