Effective BEV object detection on infrastructure can greatly improve traffic scenes understanding and vehicle-toinfrastructure (V2I) cooperative perception. However, cameras installed on infrastructure have various postures, and previous BEV detection methods rely on accurate calibration, which is difficult for practical applications due to inevitable natural factors (e.g., wind and snow). In this paper, we propose a Calibration-free BEV Representation (CBR) network, which achieves 3D detection based on BEV representation without calibration parameters and additional depth supervision. Specifically, we utilize two multi-layer perceptrons for decoupling the features from perspective view to front view and birdeye view under boxes-induced foreground supervision. Then, a cross-view feature fusion module matches features from orthogonal views according to similarity and conducts BEV feature enhancement with front view features. Experimental results on DAIR-V2X demonstrate that CBR achieves acceptable performance without any camera parameters and is naturally not affected by calibration noises. We hope CBR can serve as a baseline for future research addressing practical challenges of infrastructure perception.
翻译:有效的BEV目标检测在基础设施上能够极大提升交通场景理解及车路协同(V2I)感知能力。然而,部署在基础设施上的摄像头具有多种姿态,以往的BEV检测方法依赖精确标定,但由于不可避免的自然因素(如风、雪),实际应用中难以实现。本文提出一种无标定BEV表示网络(CBR),无需标定参数及额外深度监督即可实现基于BEV表示的3D检测。具体而言,我们利用两个多层感知机在框引导的前景监督下,将特征从透视视图解耦为前视图和鸟瞰视图。随后,跨视图特征融合模块根据相似度匹配正交视图间的特征,并利用前视图特征进行BEV特征增强。在DAIR-V2X上的实验结果表明,CBR无需任何相机参数即可达到可接受的性能,且自然不受标定噪声影响。我们希望CBR作为应对基础设施感知实际挑战的基线,为未来研究提供参考。