Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.
翻译:在大量数据上训练的基础模型已在文本、图像、音频和视频领域展现出卓越的推理与生成能力。Roblox的目标是构建这样一个面向三维智能的基础模型,该模型能够支持开发者生成Roblox体验的各个方面——从生成三维物体与场景、为动画角色绑定骨骼,到生成描述物体行为的程序化脚本。我们讨论了此类三维基础模型的三个关键设计要求,并介绍了构建该模型的第一步工作。我们预期三维几何形状将成为核心数据类型,并阐述了针对三维形状分词器的解决方案。我们展示了该分词方案如何应用于文本到形状生成、形状到文本生成以及文本到场景生成等任务。通过实验验证了这些应用如何与现有大语言模型(LLMs)协同完成场景分析与推理。最后,我们通过讨论勾勒出构建完全统一的三维智能基础模型的实现路径。