Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
翻译:构建能够以三维方式理解世界的机器,对于辅助设计师构建与编辑三维环境,以及帮助机器人在三维空间中导航与交互至关重要。受语言与图像建模进展的启发,我们探索了自回归模型在一种新模态——结构化三维场景——上的潜力。为此,我们提出了一个统一的LLM框架,用于对齐语言、图像与三维场景,并提供了一份详细的“操作指南”,阐述了实现最优训练与性能的关键设计选择,涉及数据表示、模态特定目标等关键问题。我们展示了如何对复杂三维对象进行令牌化,以将其纳入我们的结构化三维场景模态。我们在四个核心三维任务——渲染、识别、指令跟随与问答——以及四个合成与真实世界三维数据集上评估了性能。我们展示了我们的模型在从单张图像重建包含复杂对象的完整三维场景,以及在真实世界三维物体识别任务上的有效性。项目网页:https://glab-caltech.github.io/kyvo/