To achieve accurate 3D object detection at a low cost for autonomous driving, many multi-camera methods have been proposed and solved the occlusion problem of monocular approaches. However, due to the lack of accurate estimated depth, existing multi-camera methods often generate multiple bounding boxes along a ray of depth direction for difficult small objects such as pedestrians, resulting in an extremely low recall. Furthermore, directly applying depth prediction modules to existing multi-camera methods, generally composed of large network architectures, cannot meet the real-time requirements of self-driving applications. To address these issues, we propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our lightweight depth predictor is designed to produce precise object-wise sparse depth maps and low-dimensional depth embeddings without extra depth datasets during supervision. Second, a cross-view depth-guided transformer is developed to fuse the depth embeddings as well as image features from cameras of different views and generate 3D bounding boxes. Extensive experiments demonstrated that our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics. Also, computational analyses showed that our method is 5 times faster than prior approaches. Our codes will be made publicly available at https://github.com/sty61010/CrossDTR.
翻译:为实现自动驾驶中低成本且精准的三维目标检测,现有多种多相机方法已解决单目方法的遮挡问题。然而,由于缺乏准确估计的深度信息,现有多相机方法在检测行人等困难小目标时,常沿深度方向射线生成多个边界框,导致极低的召回率。此外,将深度预测模块直接应用于由大型网络架构组成的现有多相机方法,无法满足自动驾驶应用的实时性要求。针对这些问题,我们提出基于跨视图与深度引导的Transformer三维目标检测方法CrossDTR。首先,设计轻量级深度预测器,无需额外深度数据集监督即可生成精确的逐目标稀疏深度图与低维深度嵌入。其次,开发跨视图深度引导Transformer,融合来自不同视角相机的深度嵌入与图像特征,生成三维边界框。大量实验表明,本方法在行人检测中超越现有多相机方法10%,在整体mAP和NDS指标上超越约3%。计算复杂度分析显示,本方法速度较先前方法提升5倍。代码将开源至https://github.com/sty61010/CrossDTR。