We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The essence is constructing a backward bridge to propagate information from model predictions (e.g., object locations and sizes) to image and BEV features, which forms a circle with regular inference. After backward refinement, the responses of target-irrelevant regions in historical frames would be suppressed, decreasing the risk of polluting future frames and improving the object awareness ability of temporal fusion. We further tailor an object-aware association strategy for tracking based on the cyclic learning model. The cyclic learning model not only provides refined features, but also delivers finer clues (e.g., scale level) for tracklet association. The proposed cycle learning method and association module together contribute a novel and unified multi-task framework. Experiments on nuScenes show that the proposed model achieves consistent performance gains over baselines of different designs (i.e., dense query-based BEVFormer, sparse query-based SparseBEV and LSS-based BEVDet4D) on both detection and tracking evaluation.
翻译:本文提出了一种面向多视角3D检测与跟踪任务的统一对象感知时序学习框架。通过观察发现,现有多视角感知方法中的时序融合策略可能因历史帧中的干扰物与背景杂波而降低效能,为此我们提出一种循环学习机制以增强多视角表征学习的鲁棒性。其核心在于构建后向桥梁,将模型预测信息(如物体位置与尺寸)传播至图像与BEV特征,从而与常规前向推理构成闭环。经过后向精炼,历史帧中目标无关区域的响应将被抑制,这降低了污染未来帧的风险,并提升了时序融合的对象感知能力。基于该循环学习模型,我们进一步定制了面向跟踪任务的对象感知关联策略。该循环学习模型不仅提供精炼后的特征,还为轨迹关联传递更精细的线索(如尺度层级)。所提出的循环学习方法与关联模块共同构成了一种新颖统一的多任务框架。在nuScenes数据集上的实验表明,所提模型在不同设计基线(即基于密集查询的BEVFormer、基于稀疏查询的SparseBEV以及基于LSS的BEVDet4D)上,于检测与跟踪评估中均取得了一致的性能提升。