Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones. Because of the substantial differences in object distribution and point density within point clouds collected from various environments, coupled with the intricate nature of 3D metrics, there is still a lack of a unified network architecture that can accommodate diverse scenes. In this paper, we propose Uni3DETR, a unified 3D detector that addresses indoor and outdoor 3D detection within the same framework. Specifically, we employ the detection transformer with point-voxel interaction for object prediction, which leverages voxel features and points for cross-attention and behaves resistant to the discrepancies from data. We then propose the mixture of query points, which sufficiently exploits global information for dense small-range indoor scenes and local information for large-range sparse outdoor ones. Furthermore, our proposed decoupled IoU provides an easy-to-optimize training target for localization by disentangling the xy and z space. Extensive experiments validate that Uni3DETR exhibits excellent performance consistently on both indoor and outdoor 3D detection. In contrast to previous specialized detectors, which may perform well on some particular datasets but suffer a substantial degradation on different scenes, Uni3DETR demonstrates the strong generalization ability under heterogeneous conditions (Fig. 1). Codes are available at \href{https://github.com/zhenyuw16/Uni3DETR}{https://github.com/zhenyuw16/Uni3DETR}.
翻译:现有的基于点云的三维检测器专门针对特定场景设计(室内或室外)。由于不同环境采集的点云在物体分布与点密度上存在显著差异,加之三维度量指标的复杂性,目前仍缺乏能适应多种场景的统一网络架构。本文提出Uni3DETR——一种统一三维检测器,在相同框架内同时处理室内与室外三维检测任务。具体而言,我们采用基于点-体素交互的检测Transformer进行物体预测,该方法利用体素特征与点云实现交叉注意力机制,对数据差异具有鲁棒性。随后提出查询点混合策略,该策略能充分挖掘密集小范围室内场景的全局信息与稀疏大范围室外场景的局部信息。此外,所提出的解耦IoU通过分离xy与z空间,为定位任务提供了易于优化的训练目标。大量实验证明,Uni3DETR在室内与室外三维检测中均能保持卓越性能。与某些特定数据集上表现优异但在不同场景中性能显著下降的专用检测器相比,Uni3DETR在异构条件下展现出强大的泛化能力(图1)。代码已开源在\href{https://github.com/zhenyuw16/Uni3DETR}{https://github.com/zhenyuw16/Uni3DETR}。