Generating 3D visual scenes is at the forefront of visual generative AI, but current 3D generation techniques struggle with generating scenes with multiple high-resolution objects. Here we introduce Lay-A-Scene, which solves the task of Open-set 3D Object Arrangement, effectively arranging unseen objects. Given a set of 3D objects, the task is to find a plausible arrangement of these objects in a scene. We address this task by leveraging pre-trained text-to-image models. We personalize the model and explain how to generate images of a scene that contains multiple predefined objects without neglecting any of them. Then, we describe how to infer the 3D poses and arrangement of objects from a 2D generated image by finding a consistent projection of objects onto the 2D scene. We evaluate the quality of Lay-A-Scene using 3D objects from Objaverse and human raters and find that it often generates coherent and feasible 3D object arrangements.
翻译:三维视觉场景生成是视觉生成式人工智能的前沿领域,但当前的三维生成技术在生成包含多个高分辨率物体的场景时仍面临挑战。本文介绍Lay-A-Scene,它解决了开放集三维物体布局任务,能够有效排列未见过的物体。给定一组三维物体,该任务旨在为这些物体在场景中找到一个合理的布局。我们通过利用预训练的文本到图像模型来解决此任务。我们对模型进行个性化处理,并阐释如何生成包含多个预定义物体且不遗漏任何物体的场景图像。随后,我们描述了如何通过寻找物体在二维场景上的一致投影,从生成的二维图像中推断出物体的三维姿态与布局。我们使用来自Objaverse的三维物体和人工评估者对Lay-A-Scene的质量进行评估,发现其通常能生成连贯且可行的三维物体布局。