Comprehending the environment and accurately detecting objects in 3D space are essential for advancing autonomous vehicle technologies. Integrating Camera and LIDAR data has emerged as an effective approach for achieving high accuracy in 3D Object Detection models. However, existing methodologies often rely on heavy, traditional backbones that are computationally demanding. This paper introduces a novel approach that incorporates cutting-edge Deep Learning techniques into the feature extraction process, aiming to create more efficient models without compromising performance. Our model, NextBEV, surpasses established feature extractors like ResNet50 and MobileNetV2. On the KITTI 3D Monocular detection benchmark, NextBEV achieves an accuracy improvement of 2.39%, having less than 10% of the MobileNetV3 parameters. Moreover, we propose changes in LIDAR backbones that decreased the original inference time to 10 ms. Additionally, by fusing these lightweight proposals, we have enhanced the accuracy of the VoxelNet-based model by 2.93% and improved the F1-score of the PointPillar-based model by approximately 20%. Therefore, this work contributes to establishing lightweight and powerful models for individual or fusion techniques, making them more suitable for onboard implementations.
翻译:理解环境并精确检测三维空间中的物体对于推进自动驾驶技术至关重要。融合相机与激光雷达数据已成为实现高精度三维物体检测模型的有效途径。然而,现有方法通常依赖于计算需求较高的重型传统骨干网络。本文提出一种新颖方法,将前沿深度学习技术融入特征提取过程,旨在构建不牺牲性能的高效模型。我们的模型NextBEV超越了ResNet50和MobileNetV2等成熟特征提取器。在KITTI三维单目检测基准上,NextBEV以少于MobileNetV3参数数量10%的规模实现了2.39%的精度提升。此外,我们提出的激光雷达骨干网络改进方案将原始推理时间缩短至10毫秒。通过融合这些轻量化方案,我们使基于VoxelNet的模型精度提升2.93%,并将基于PointPillar模型的F1分数提高约20%。因此,本研究为独立或融合技术建立了轻量而强大的模型体系,使其更适用于车载部署场景。