Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera with lidar or radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we propose HRFuser, a modular architecture for multi-modal 2D object detection. It fuses multiple sensors in a multi-resolution fashion and scales to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. We demonstrate via extensive experiments on nuScenes and the adverse conditions DENSE datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art 3D and 2D fusion methods evaluated on 2D object detection metrics. The source code is publicly available.
翻译:除标准摄像头外,自动驾驶车辆通常配备多种附加传感器(如激光雷达和雷达),以获取更丰富的驾驶场景感知信息。尽管近期多项研究聚焦于特定传感器对(如摄像头-激光雷达或摄像头-雷达)的融合,并采用针对该场景设计的特定架构组件,但学术界仍缺乏通用且模块化的传感器融合架构。本文提出HRFuser——一种用于多模态二维目标检测的模块化架构。该架构以多分辨率方式融合多类传感器,并可扩展至任意数量的输入模态。HRFuser的设计基于面向图像密集预测的最新高分辨率网络,并创新性地引入多窗口交叉注意力模块,作为多分辨率下多模态融合的核心机制。通过在nuScenes及恶劣环境DENSE数据集上的大量实验证明,我们的模型能有效利用附加模态的互补特征,显著提升纯摄像头方案的性能,并在二维目标检测指标上持续优于当前最先进的二维与三维融合方法。源代码已开源。