3D object detection based on roadside cameras is an additional way for autonomous driving to alleviate the challenges of occlusion and short perception range from vehicle cameras. Previous methods for roadside 3D object detection mainly focus on modeling the depth or height of objects, neglecting the stationary of cameras and the characteristic of inter-frame consistency. In this work, we propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. The scene cues are the frame-invariant scene-specific features, which are crucial for object localization and can be intuitively regarded as the height between the surface of the real road and the virtual ground plane. In the proposed framework, a scene cue bank is designed to aggregate scene cues from multiple frames of the same scene with a carefully designed extrinsic augmentation strategy. Then, a transformer-based decoder lifts the aggregated scene cues as well as the 3D position embeddings for 3D object location, which boosts generalization ability in heterologous scenes. The extensive experiment results on two public benchmarks demonstrate the state-of-the-art performance of the proposed method, which surpasses the existing methods by a large margin.
翻译:摘要:基于路侧摄像头的3D目标检测是自动驾驶领域的一种补充方法,旨在缓解车载摄像头在遮挡和感知范围不足方面的挑战。以往的路侧3D目标检测方法主要专注于目标深度或高度的建模,忽略了摄像头的固定特性及帧间一致性特点。本文提出了一种名为MOSE(MOnocular 3D object detection with Scene cuEs,即基于场景线索的单目3D目标检测)的新框架。场景线索是帧不变且场景特有的特征,对目标定位至关重要,可直观视为真实路面与虚拟地平面之间的高度差。在该框架中,我们设计了一个场景线索库,通过精心设计的外参增强策略从同一场景的多帧图像中聚合场景线索。随后,基于Transformer的解码器将聚合的场景线索与3D位置嵌入共同用于3D目标定位,从而提升了模型在异质场景下的泛化能力。在两个公开基准上的大量实验结果表明,所提方法性能达到当前最优,显著超越了现有方法。