Recent multi-camera 3D object detectors usually leverage temporal information to construct multi-view stereo that alleviates the ill-posed depth estimation. However, they typically assume all the objects are static and directly aggregate features across frames. This work begins with a theoretical and empirical analysis to reveal that ignoring the motion of moving objects can result in serious localization bias. Therefore, we propose to model Dynamic Objects in RecurrenT (DORT) to tackle this problem. In contrast to previous global Bird-Eye-View (BEV) methods, DORT extracts object-wise local volumes for motion estimation that also alleviates the heavy computational burden. By iteratively refining the estimated object motion and location, the preceding features can be precisely aggregated to the current frame to mitigate the aforementioned adverse effects. The simple framework has two significant appealing properties. It is flexible and practical that can be plugged into most camera-based 3D object detectors. As there are predictions of object motion in the loop, it can easily track objects across frames according to their nearest center distances. Without bells and whistles, DORT outperforms all the previous methods on the nuScenes detection and tracking benchmarks with 62.5\% NDS and 57.6\% AMOTA, respectively. The source code will be released.
翻译:摘要:近年来,多相机3D目标检测器通常利用时序信息构建多视角立体匹配,以缓解病态深度估计问题。然而,现有方法通常假设所有目标均为静态,并直接跨帧聚合特征。本文从理论与实验分析出发,揭示了忽略运动物体位移会导致严重的定位偏差。为此,我们提出基于循环的动态对象建模方法(DORT)解决该问题。与以往全局鸟瞰图(BEV)方法不同,DORT提取目标级局部体积用于运动估计,同时减轻了繁重的计算负担。通过迭代优化估计的目标运动与位置,可将前一帧的特征精准聚合至当前帧,从而缓解上述不利影响。该简洁框架具备两个显著优势:其一,灵活实用,可嵌入大多数基于相机的3D目标检测器;其二,由于内置运动预测机制,可根据最近中心距离实现跨帧目标追踪。无需额外花哨设计,DORT在nuScenes检测与追踪基准测试中分别以62.5% NDS和57.6% AMOTA超越所有先前方法。源代码将公开。