Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm, which benefits from both BEV's strong perception power and end-to-end pipeline. Despite achieving substantial progress, existing works model objects via globally leveraging temporal and spatial information of BEV features, resulting in problems when handling the challenging complex and dynamic autonomous driving scenarios. In this paper, we proposed an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively. OCBEV comprises three designs: Object Aligned Temporal Fusion aligns the BEV feature based on ego-motion and estimated current locations of moving objects, leading to a precise instance-level feature fusion. Object Focused Multi-View Sampling samples more 3D features from an adaptive local height ranges of objects for each scene to enrich foreground information. Object Informed Query Enhancement replaces part of pre-defined decoder queries in common DETR-style decoders with positional features of objects on high-confidence locations, introducing more direct object positional priors. Extensive experimental evaluations are conducted on the challenging nuScenes dataset. Our approach achieves a state-of-the-art result, surpassing the traditional BEVFormer by 1.5 NDS points. Moreover, we have a faster convergence speed and only need half of the training iterations to get comparable performance, which further demonstrates its effectiveness.
翻译:多视角3D目标检测因其高效性和低成本在自动驾驶领域日益流行。当前最先进的检测器大多遵循基于查询的鸟瞰图(BEV)范式,该范式兼具BEV的强大感知能力与端到端管线的优势。尽管已取得显著进展,现有方法通过全局利用BEV特征的时空信息来建模目标,在处理复杂动态的自动驾驶场景时仍存在问题。本文提出一种以目标为中心的查询BEV检测器OCBEV,能够更有效地刻画运动目标的时空线索。OCBEV包含三种设计:目标对齐时序融合基于自车运动与估计的移动目标当前位置对齐BEV特征,实现精确的实例级特征融合;目标聚焦多视角采样从每个场景中自适应局部高度范围提取更多3D特征以丰富前景信息;目标引导查询增强用高置信度位置的目标位置特征替换常见DETR风格解码器中部分预定义解码器查询,引入更直接的目标位置先验。在具有挑战性的nuScenes数据集上进行了大量实验评估。我们的方法达到最先进结果,超过传统BEVFormer 1.5个NDS点。此外,我们具有更快的收敛速度,仅需半数训练迭代即可获得可比性能,进一步证明了方法的有效性。