Monocular 3D object detection aims for precise 3D localization and identification of objects from a single-view image. Despite its recent progress, it often struggles while handling pervasive object occlusions that tend to complicate and degrade the prediction of object dimensions, depths, and orientations. We design MonoMAE, a monocular 3D detector inspired by Masked Autoencoders that addresses the object occlusion issue by masking and reconstructing objects in the feature space. MonoMAE consists of two novel designs. The first is depth-aware masking that selectively masks certain parts of non-occluded object queries in the feature space for simulating occluded object queries for network training. It masks non-occluded object queries by balancing the masked and preserved query portions adaptively according to the depth information. The second is lightweight query completion that works with the depth-aware masking to learn to reconstruct and complete the masked object queries. With the proposed object occlusion and completion, MonoMAE learns enriched 3D representations that achieve superior monocular 3D detection performance qualitatively and quantitatively for both occluded and non-occluded objects. Additionally, MonoMAE learns generalizable representations that can work well in new domains.
翻译:单目3D目标检测旨在从单视角图像中实现精确的3D定位与目标识别。尽管近年来取得进展,该方法在处理普遍存在的目标遮挡问题时仍面临挑战——遮挡会显著增加目标尺寸、深度及朝向预测的复杂性与误差。本文提出MonoMAE,一种受掩码自编码器启发的单目3D检测器,通过在特征空间中对目标进行掩码与重建以应对遮挡问题。MonoMAE包含两项创新设计:其一是深度感知掩码机制,在特征空间中对无遮挡目标查询的特定部分进行选择性掩码,从而模拟遮挡目标查询用于网络训练;该方法通过根据深度信息自适应平衡掩码与保留查询的比例实现。其二是轻量级查询补全模块,协同深度感知掩码学习重建并补全被掩码的目标查询。通过所提出的目标遮挡与补全机制,MonoMAE可学习到更丰富的3D表征,在定性与定量层面均实现对遮挡及非遮挡目标的卓越单目3D检测性能。此外,MonoMAE具备良好的表征泛化能力,可在新域中有效运行。