Realizing unified monocular 3D object detection, including both indoor and outdoor scenes, holds great importance in applications like robot navigation. However, involving various scenarios of data to train models poses challenges due to their significantly different characteristics, e.g., diverse geometry properties and heterogeneous domain distributions. To address these challenges, we build a detector based on the bird's-eye-view (BEV) detection paradigm, where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. Then, we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by the aforementioned challenges. Moreover, we develop a sparse BEV feature projection strategy to reduce computational cost and a unified domain alignment method to handle heterogeneous domains. Combining these techniques, a unified detector UniMODE is derived, which surpasses the previous state-of-the-art on the challenging Omni3D dataset (a large-scale dataset including both indoor and outdoor scenes) by 4.9% AP_3D, revealing the first successful generalization of a BEV detector to unified 3D object detection.
翻译:实现统一的单目三维目标检测(涵盖室内与室外场景)在机器人导航等应用中具有重要意义。然而,利用不同场景的数据训练模型面临巨大挑战,因为各类数据具有显著差异的特征(例如,多样的几何特性与异构的域分布)。针对这些挑战,我们构建了一个基于鸟瞰图(BEV)检测范式的检测器,其中显式的特征投影有助于解决利用多场景数据训练检测器时出现的几何学习歧义性。随后,我们将经典BEV检测架构拆分为两个阶段,并提出一种非均匀BEV网格设计,以应对上述挑战导致的收敛不稳定性问题。此外,我们开发了一种稀疏BEV特征投影策略以降低计算开销,以及一种统一的域对齐方法以处理异构域。通过整合上述技术,我们推导出统一检测器UniMODE,其在极具挑战性的Omni3D数据集(一个包含室内与室外场景的大规模数据集)上以4.9%的AP_3D超越此前最优性能,首次揭示了BEV检测器在统一三维目标检测中的成功泛化。