Multi-view 3D object detection (MV3D-Det) in Bird-Eye-View (BEV) has drawn extensive attention due to its low cost and high efficiency. Although new algorithms for camera-only 3D object detection have been continuously proposed, most of them may risk drastic performance degradation when the domain of input images differs from that of training. In this paper, we first analyze the causes of the domain gap for the MV3D-Det task. Based on the covariate shift assumption, we find that the gap mainly attributes to the feature distribution of BEV, which is determined by the quality of both depth estimation and 2D image's feature representation. To acquire a robust depth prediction, we propose to decouple the depth estimation from the intrinsic parameters of the camera (i.e. the focal length) through converting the prediction of metric depth to that of scale-invariant depth and perform dynamic perspective augmentation to increase the diversity of the extrinsic parameters (i.e. the camera poses) by utilizing homography. Moreover, we modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic. Without bells and whistles, our approach, namely DG-BEV, successfully alleviates the performance drop on the unseen target domain without impairing the accuracy of the source domain. Extensive experiments on various public datasets, including Waymo, nuScenes, and Lyft, demonstrate the generalization and effectiveness of our approach. To the best of our knowledge, this is the first systematic study to explore a domain generalization method for MV3D-Det.
翻译:鸟瞰图下的多视角三维目标检测因其低成本和高效率而受到广泛关注。尽管基于纯相机的三维目标检测新算法不断涌现,但当输入图像的域与训练域不同时,大多数算法可能面临性能急剧下降的风险。本文首先分析了多视角三维目标检测任务中域差距的成因。基于协变量偏移假设,我们发现该差距主要源于鸟瞰图特征分布,而该分布由深度估计质量和二维图像特征表示质量共同决定。为获得鲁棒的深度预测,我们提出将深度估计与相机内参(即焦距)解耦,通过将度量深度预测转换为尺度不变深度预测,并利用单应性矩阵进行动态视角增强以增加外参(即相机位姿)多样性。此外,我们通过修改焦距值创建多个伪域,并构建对抗训练损失以促进特征表示更具域不变性。无需额外复杂设计,我们的方法(命名为DG-BEV)成功缓解了在未知目标域上的性能下降,同时不影响源域精度。在Waymo、nuScenes和Lyft等公开数据集上的大量实验证明了该方法的泛化性和有效性。据我们所知,这是首个系统性探索多视角三维目标检测域泛化方法的研究。