Monocular 3D object detection is a crucial and challenging task for autonomous driving vehicle, while it uses only a single camera image to infer 3D objects in the scene. To address the difficulty of predicting depth using only pictorial clue, we propose a novel perspective-aware convolutional layer that captures long-range dependencies in images. By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture. We integrate our perspective-aware convolutional layer into a 3D object detector and demonstrate improved performance on the KITTI3D dataset, achieving a 23.9\% average precision in the easy benchmark. These results underscore the importance of modeling scene clues for accurate depth inference and highlight the benefits of incorporating scene structure in network design. Our perspective-aware convolutional layer has the potential to enhance object detection accuracy by providing more precise and context-aware feature extraction.
翻译:单目三维目标检测是自动驾驶车辆中一项关键且具有挑战性的任务,它仅利用单张相机图像来推断场景中的三维物体。为了解决仅凭视觉线索预测深度的困难,我们提出了一种新颖的视角感知卷积层,该层能够捕获图像中的长距离依赖关系。通过强制卷积核沿着每个图像像素的深度轴提取特征,我们将视角信息整合到网络架构中。我们将视角感知卷积层集成到三维目标检测器中,并在KITTI3D数据集上展示了改进的性能,在简单基准测试中达到了23.9%的平均精度。这些结果强调了建模场景线索对于精确深度推断的重要性,并凸显了在网络设计中融入场景结构的优势。我们的视角感知卷积层通过提供更精确且上下文感知的特征提取,具有提升目标检测准确性的潜力。