Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain.
翻译:过去几年中,图像或文本表示的规模化扩展已得到深入研究,并引发了视觉与语言学习的革命。然而,针对3D物体与场景的可扩展表示仍相对探索不足。本文提出Uni3D——一个旨在探索大规模统一3D表示的3D基础模型。Uni3D采用2D初始化的ViT进行端到端预训练,使3D点云特征与图文对齐特征实现对齐。通过简洁的架构与代理任务,Uni3D能够利用丰富的2D预训练模型作为初始化参数,并以图像-文本对齐模型作为目标,从而释放2D模型与规模化策略在3D领域的巨大潜力。我们高效地将Uni3D扩展至十亿参数规模,并在零样本分类、小样本分类、开放世界理解与部件分割等广泛3D任务中创下新纪录。研究表明,强大的Uni3D表示还能支持3D绘画与野外检索等应用。我们认为Uni3D为探索3D领域表示的规模化与效率提供了新方向。