Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains leading performance on various camera-based 3D perception tasks, including object detection (55.4% mAP and 62.9% NDS), segmentation (48.6% vehicle mIoU), tracking (54.8% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA). Code will be available.
翻译:长期时间融合是基于相机的鸟瞰图(BEV)3D感知中一个至关重要但经常被忽视的技术。现有方法大多采用并行融合方式。虽然并行融合能从长期信息中获益,但随着融合窗口的增大,计算和内存开销也不断增加。另一种方法是BEVFormer采用循环融合流水线,可以有效集成历史信息,但无法从更长的时间帧中受益。在本文中,我们探索了一种基于LSS方法构建的非常简单的长期循环融合策略,并发现它已经能够同时享受两者的优点,即丰富的长期信息和高效的融合流水线。我们进一步提出了一种时间嵌入模块,以增强模型在实际场景中对偶尔缺失帧的鲁棒性。我们将这种简单但有效的融合流水线命名为VideoBEV。在nuScenes基准上的实验结果表明,VideoBEV在各种基于相机的3D感知任务中取得了领先性能,包括目标检测(55.4% mAP和62.9% NDS)、分割(48.6%车辆mIoU)、跟踪(54.8% AMOTA)和运动预测(0.80m minADE和0.463 EPA)。代码将公开。