TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation

Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.

翻译：自主驾驶需要精确的环境表征。提升精度的策略之一是融合多传感器数据。学习型鸟瞰图编码器可将各传感器数据映射到统一潜在空间，实现这一目标。对于成本高效的纯视觉系统，该机制能够有效融合来自多个不同视角的相机数据。通过聚合传感器时序信息可进一步提高精度，这对单目相机系统尤为重要，因其缺乏显式深度与速度测量。因此，所设计的BEV编码器的有效性关键取决于时序信息聚合算子及潜在表征空间。我们系统分析了文献中提出的BEV编码器并比较其效能，量化了聚合算子与潜在表征的影响。现有方法大多在图像或BEV潜在空间中单独聚合时序信息，但我们的分析与性能对比表明，这些潜在表征具有互补优势。为此，我们提出新型时序BEV编码器TempBEV，该编码器整合了来自两个潜在空间的聚合时序信息。我们将连续图像帧视为时间维度的立体视觉对，并利用光流估计方法进行时序立体编码。在NuScenes数据集上的实验评估显示，相比基线模型，TempBEV在3D目标检测与BEV分割任务中均取得显著提升。消融实验揭示了图像与BEV联合时域聚合的强协同效应。这些结果验证了我们方法的整体有效性，并有力支持了在图像与BEV双潜在空间中进行时序信息聚合的策略。