Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction

In this paper, we propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. The HoP approach is straightforward: given the current timestamp t, we generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its adjacent frames and utilize this feature to predict the object set at timestamp t-k. Our approach is motivated by the observation that enforcing the detector to capture both the spatial location and temporal motion of objects occurring at historical timestamps can lead to more accurate BEV feature learning. First, we elaborately design short-term and long-term temporal decoders, which can generate the pseudo BEV feature for timestamp t-k without the involvement of its corresponding camera images. Second, an additional object decoder is flexibly attached to predict the object targets using the generated pseudo BEV feature. Note that we only perform HoP during training, thus the proposed method does not introduce extra overheads during inference. As a plug-and-play approach, HoP can be easily incorporated into state-of-the-art BEV detection frameworks, including BEVFormer and BEVDet series. Furthermore, the auxiliary HoP approach is complementary to prevalent temporal modeling methods, leading to significant performance gains. Extensive experiments are conducted to evaluate the effectiveness of the proposed HoP on the nuScenes dataset. We choose the representative methods, including BEVFormer and BEVDet4D-Depth to evaluate our method. Surprisingly, HoP achieves 68.5% NDS and 62.4% mAP with ViT-L on nuScenes test, outperforming all the 3D object detectors on the leaderboard. Codes will be available at https://github.com/Sense-X/HoP.

翻译：本文提出一种名为历史目标预测（HoP）的新范式，用于多视图三维检测以更有效地利用时序信息。HoP方法简洁直接：给定当前时间戳t，我们从其相邻帧生成时间戳t-k的伪鸟瞰图（BEV）特征，并利用该特征预测时间戳t-k处的目标集合。该方法的动机源于以下发现：强制检测器同时捕捉历史时间戳中目标的空间位置与时序运动，可促进更准确的BEV特征学习。首先，我们精心设计了短期与长期时序解码器，无需对应相机图像即可生成时间戳t-k的伪BEV特征。其次，灵活附加一个目标解码器，利用生成的伪BEV特征预测目标。需注意，我们仅在训练阶段执行HoP，因此该方法在推理时不引入额外开销。作为即插即用方法，HoP可轻松集成到包括BEVFormer和BEVDet系列在内的先进BEV检测框架中。此外，辅助HoP方法与主流时序建模方法互补，可显著提升性能。我们在nuScenes数据集上开展广泛实验评估HoP的有效性，选取BEVFormer和BEVDet4D-Depth作为代表性基准方法。令人惊讶的是，结合ViT-L骨干网络，HoP在nuScenes测试集上达到68.5% NDS和62.4% mAP，超越排行榜上所有三维目标检测器。代码将开源至https://github.com/Sense-X/HoP。