Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains leading performance on various camera-based 3D perception tasks, including object detection (55.4% mAP and 62.9% NDS), segmentation (48.6% vehicle mIoU), tracking (54.8% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA). Code will be available.
翻译:长期时序融合是基于摄像头的鸟瞰视角三维感知中一项关键但常被忽视的技术。现有方法多采用并行融合方式。虽然并行融合能从长期信息中获益,但随着融合窗口增大,计算与内存开销会不断增加。另一种方案是BEVFormer采用循环融合流程,可高效集成历史信息,但无法利用更长时间的帧序列。本文在基于LSS方法的基础上,探索了一种极其简单的长期循环融合策略,发现其已能兼顾两方面优势——既富含长期信息,又具备高效融合流程。我们进一步提出时序嵌入模块,以增强模型在真实场景中应对偶发帧缺失的鲁棒性。我们将这种简单有效的融合流程命名为VideoBEV。在nuScenes基准上的实验结果表明,VideoBEV在多种基于摄像头的三维感知任务中均取得领先性能,包括目标检测(55.4% mAP、62.9% NDS)、语义分割(48.6%车辆mIoU)、目标跟踪(54.8% AMOTA)以及运动预测(0.80m最小ADE、0.463 EPA)。代码将开源。