Detecting objects in 3D space from monocular input is crucial for applications ranging from robotics to scene understanding. Despite advanced performance in the indoor and autonomous driving domains, existing monocular 3D detection models struggle with in-the-wild images due to the lack of 3D in-the-wild datasets and the challenges of 3D annotation. We introduce LabelAny3D, an \emph{analysis-by-synthesis} framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations. Built on this pipeline, we present COCO3D, a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset and covering a wide range of object categories absent from existing 3D datasets. Experiments show that annotations generated by LabelAny3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. These results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings.
翻译:从单目输入中检测三维空间中的物体,对于从机器人技术到场景理解的一系列应用至关重要。尽管在室内和自动驾驶领域已取得先进性能,但现有的单目三维检测模型在处理野外图像时仍面临困难,这主要源于缺乏野外三维数据集以及三维标注的挑战。我们提出了LabelAny3D,一个基于“分析-合成”的框架,它从二维图像重建整体的三维场景,从而高效地生成高质量的三维边界框标注。基于此流程,我们构建了COCO3D,这是一个用于开放词汇单目三维检测的新基准数据集,它源自MS-COCO数据集,涵盖了现有三维数据集中缺失的广泛物体类别。实验表明,由LabelAny3D生成的标注提升了多个基准测试上的单目三维检测性能,在质量上超越了先前的自动标注方法。这些结果证明了基础模型驱动的标注方法在现实、开放世界场景中扩展三维识别能力的潜力。