Realizing unified monocular 3D object detection, including both indoor and outdoor scenes, holds great importance in applications like robot navigation. However, involving various scenarios of data to train models poses challenges due to their significantly different characteristics, e.g., diverse geometry properties and heterogeneous domain distributions. To address these challenges, we build a detector based on the bird's-eye-view (BEV) detection paradigm, where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. Then, we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by the aforementioned challenges. Moreover, we develop a sparse BEV feature projection strategy to reduce computational cost and a unified domain alignment method to handle heterogeneous domains. Combining these techniques, a unified detector UniMODE is derived, which surpasses the previous state-of-the-art on the challenging Omni3D dataset (a large-scale dataset including both indoor and outdoor scenes) by 4.9% AP_3D, revealing the first successful generalization of a BEV detector to unified 3D object detection.
翻译:实现统一的单目三维目标检测(涵盖室内和室外场景)在机器人导航等应用中具有重要意义。然而,利用多种场景数据训练模型会面临严峻挑战,因为不同场景具有显著差异的特征,例如多样的几何属性与非均匀的域分布。为应对这些挑战,我们基于鸟瞰图(BEV)检测范式构建了一个检测器,其显式特征投影有助于解决多场景数据训练时产生的几何学习歧义。进一步地,我们将经典BEV检测架构拆分为两个阶段,并提出非均匀BEV网格设计以处理上述挑战引发的收敛不稳定问题。此外,我们开发了稀疏BEV特征投影策略以降低计算成本,并设计了统一域对齐方法以处理异质域。通过整合这些技术,最终得到统一检测器UniMODE,其在具有挑战性的Omni3D数据集(包含室内与室外场景的大规模数据集)上以4.9%的AP_3D超越此前最优方法,首次验证了BEV检测器在统一三维目标检测中的成功泛化能力。