Capturing and labeling real-world 3D data is laborious and time-consuming, which makes it costly to train strong 3D models. To address this issue, previous works generate randomized 3D scenes and pre-train models on generated data. Although the pre-trained models gain promising performance boosts, previous works have two major shortcomings. First, they focus on only one downstream task (i.e., object detection). Second, a fair comparison of generated data is still lacking. In this work, we systematically compare data generation methods using a unified setup. To clarify the generalization of the pre-trained models, we evaluate their performance in multiple tasks (e.g., object detection and semantic segmentation) and with different pre-training methods (e.g., masked autoencoder and contrastive learning). Moreover, we propose a new method to generate 3D scenes with spherical harmonics. It surpasses the previous formula-driven method with a clear margin and achieves on-par results with methods using real-world scans and CAD models.
翻译:捕获和标注真实世界三维数据既费时又费力,这导致训练强大的三维模型成本高昂。为解决此问题,先前工作生成了随机化的三维场景,并在生成数据上预训练模型。尽管预训练模型取得了显著性能提升,但先前工作存在两个主要缺陷:其一,仅关注单一下游任务(即目标检测);其二,对生成数据的公平比较仍显不足。在本工作中,我们采用统一设置系统性地比较了数据生成方法。为阐明预训练模型的泛化能力,我们在多个任务(如目标检测和语义分割)及不同预训练方法(如掩码自编码器和对比学习)上评估其性能。此外,我们提出了一种基于球谐函数生成三维场景的新方法。该方法以显著优势超越了先前的公式驱动方法,并达到了与使用真实扫描和CAD模型方法相当的性能。