3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a prevalent approach in the field of autonomous driving. Despite the demonstrated improvements in accuracy and velocity estimation compared to perspective view methods, the deployment of BEV-based techniques in real-world autonomous vehicles remains challenging. This is primarily due to their reliance on vision-transformer (ViT) based architectures, which introduce quadratic complexity with respect to the input resolution. To address this issue, we propose an efficient BEV-based 3D detection framework called BEVENet, which leverages a convolutional-only architectural design to circumvent the limitations of ViT models while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3$\times$ faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.
翻译:鸟瞰空间中的三维目标检测近期已成为自动驾驶领域的主流方法。尽管与透视视角方法相比,鸟瞰空间方法在精度和速度估计上展现出显著提升,但基于鸟瞰空间的技术在实际自动驾驶车辆中的部署仍面临挑战。这主要源于其依赖基于视觉变换器的架构,该架构对输入分辨率具有二次复杂度。为解决这一问题,我们提出了一种名为BEVENet的高效鸟瞰空间三维检测框架,该框架采用纯卷积架构设计,在规避视觉变换模型局限性的同时,保持鸟瞰空间方法的有效性。实验结果表明,在NuScenes挑战赛中,BEVENet的处理速度比当前最先进方法快3倍,在NuScenes验证数据集上实现了0.456的平均精度和0.555的NuScenes检测分数,推理速度达到47.6帧/秒。据我们所知,本研究首次为鸟瞰空间方法实现如此显著的效率提升,凸显了其在实际自动驾驶应用中的更高可行性。