What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.
翻译:何为特定场景的“氛围”?“繁忙肮脏的城市街道”、“田园诗般的乡村”或“废弃客厅中的犯罪现场”应包含哪些元素?现有系统受限于刚性且有限的数据集,无法泛化地从抽象场景描述转化为风格化场景元素。本文提出利用基础模型捕获的知识来实现这一转化。我们呈现的系统可作为工具,通过简短短语为3D场景生成风格化资产,无需枚举场景内物体或指定其外观。此外,该系统对开放世界概念的鲁棒性远超传统基于有限数据训练的方法,为3D艺术家赋予更多创意自由。该系统通过由大语言模型、视觉-语言模型及多个图像扩散模型组成的基础模型“团队”,借助可解释且用户可编辑的中间表示进行通信,从而实现更灵活可控的风格化资产生成。我们为这一任务引入新指标,并通过人类评估表明:在91%的案例中,系统输出对输入场景描述的语义忠实度优于基线方法,从而凸显该方案对加速3D艺术家内容创作流程的潜力。