AsyncBEV：异步三维目标检测中的跨模态流对齐 (AsyncBEV: Cross-modal Flow Alignment in Asynchronous 3D Object Detection)

In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds' Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by $16.6$ % and $11.9$ % NDS on dynamic objects in the worst-case scenario of a $0.5 s$ time offset. Code will be released upon acceptance.

翻译：在自动驾驶中，三维目标检测等多模态感知任务通常依赖于训练和推理阶段均良好同步的传感器。然而，尽管采用了基于硬件或软件的同步算法，完美的同步性很少能得到保证：传感器可能以不同频率运行，且网络延迟、硬件故障或处理瓶颈等现实因素常常会在传感器之间引入时间偏移。这种异步性会降低感知性能，尤其是对于动态物体。为应对这一挑战，我们提出了AsyncBEV——一个可训练的轻量级通用模块，旨在提升三维鸟瞰图目标检测模型对传感器异步性的鲁棒性。受场景流估计的启发，AsyncBEV首先从两种不同传感器模态的BEV特征中估计二维流，同时考虑这些传感器测量之间已知的时间偏移。预测的特征流随后被用于对特征图进行扭曲和空间对齐，我们证明该模块可以轻松集成到当前不同的BEV检测器架构中（例如基于BEV网格和基于令牌的架构）。大量实验表明，AsyncBEV提升了基于令牌的CMT和基于网格的UniBEV对激光雷达或相机传感器之间小范围和大范围异步的鲁棒性，尤其针对动态物体。在0.5秒时间偏移的最坏情况下，我们在动态物体上的NDS指标显著优于经过自运动补偿的CMT和UniBEV基线模型，分别超出16.6%和11.9%。代码将在论文被接受后发布。