We present 3DiffTection, a state-of-the-art method for 3D object detection from single images, leveraging features from a 3D-aware diffusion model. Annotating large-scale image data for 3D detection is resource-intensive and time-consuming. Recently, pretrained large image diffusion models have become prominent as effective feature extractors for 2D perception tasks. However, these features are initially trained on paired text and image data, which are not optimized for 3D tasks, and often exhibit a domain gap when applied to the target data. Our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we fine-tune a diffusion model to perform novel view synthesis conditioned on a single image, by introducing a novel epipolar warp operator. This task meets two essential criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos) and does not require manual annotation. For semantic refinement, we further train the model on target data with detection supervision. Both tuning phases employ ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these enhanced capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through our methodology, we obtain 3D-aware features that are tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, our model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and generalization to cross-domain data.
翻译:我们提出3DiffTection——一种基于单张图像实现三维物体检测的先进方法,其核心在于利用三维感知扩散模型的特征。为三维检测任务标注大规模图像数据需要大量资源与时间。近年来,预训练的大规模图像扩散模型已成为二维感知任务中有效的特征提取器。然而,这些特征最初基于文本-图像配对数据训练,未针对三维任务优化,且应用于目标数据时存在领域差异。我们的方法通过两种专门调优策略——几何调优与语义调优——弥合这些差异。对于几何调优,我们通过引入新型极线扭曲算子微调扩散模型,使其能够基于单张图像完成新视角合成。该任务满足两个关键条件:既要具备三维感知能力,又仅需可直接获取的姿态图像数据(如视频帧),无需人工标注。对于语义精炼,我们进一步在目标数据上结合检测监督训练模型。两个调优阶段均采用ControlNet以保持原始特征能力的完整性。最后,我们利用增强后的特征对多个虚拟视角进行测试时预测集成。通过该方法,我们获得了适用于三维检测且擅长识别跨视角点对应关系的三维感知特征。由此,我们的模型成为强大的三维检测器,在Omni3D-ARkitscene数据集上以AP3D指标超越先前基准(如单视角三维检测先驱Cube-RCNN)9.43%。此外,3DiffTection展现出优异的数据效率与跨领域数据泛化能力。