Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However, their performance drops on larger objects, leading to fatal accidents. Some attribute the failures to training data scarcity or their receptive field requirements of large objects. In this paper, we highlight this understudied problem of generalization to large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap, we comprehensively investigate regression and dice losses, examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights, we propose SeaBird (Segmentation in Bird's View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection, with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard, particularly for large objects. Code and models at https://github.com/abhi1kumar/SeaBird
翻译:单目三维检测器在汽车及较小目标上表现出色,但在大目标上性能下降,易导致致命事故。部分研究将其归因于训练数据稀缺或大目标对感受野的特殊需求。本文聚焦这一被忽视的大目标泛化难题,发现即使使用近平衡数据集,现代前视检测器也难以泛化至大目标。我们认为根本原因在于深度回归损失对大目标噪声的敏感性。为弥合这一差距,我们全面研究了回归损失与骰子损失,考察其在误差幅度和目标尺寸变化下的鲁棒性。通过数学证明,在简化场景下骰子损失对大目标具有更优的噪声鲁棒性与模型收敛性。基于理论洞见,我们提出SeaBird(鸟瞰图分割)作为迈向大目标泛化的第一步。该方法将鸟瞰图前景目标分割有效集成至三维检测中,分割头采用骰子损失训练。SeaBird在KITTI-360排行榜上达到最优水平,并提升了nuScenes排行榜上现有检测器(尤其针对大目标)的性能。代码与模型开源于https://github.com/abhi1kumar/SeaBird