Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan, so that the scene looks realistic and is functionally plausible. Such scenes can serve as homes for immersive 3D experiences, or be used to train embodied agents. Existing methods for this task rely on labeled categories of furniture, e.g. bed, chair or table, to generate contextually relevant combinations of furniture. Whether heuristic or learned, these methods ignore instance-level visual attributes of objects, and as a result may produce visually less coherent scenes. In this paper, we introduce an auto-regressive scene model which can output instance-level predictions, using general purpose image embedding based on CLIP. This allows us to learn visual correspondences such as matching color and style, and produce more functionally plausible and aesthetically pleasing scenes. Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene synthesis and improves auto-completion metrics by over 50%. Moreover, our embedding-based approach enables zero-shot text-guided scene synthesis and editing, which easily generalizes to furniture not seen during training.
翻译:室内场景合成涉及在平面图上自动选择合适的家具并合理摆放,使场景既具真实感又功能合理。此类场景可作为沉浸式三维体验的虚拟家居环境,或用于训练具身智能体。现有方法依赖标注的家具类别(如床、椅子或桌子)生成语境相关的家具组合。无论是启发式方法还是学习方法,这些方法都忽略了物体的实例级视觉属性,导致生成的场景视觉一致性较差。本文提出一种自回归场景模型,通过基于CLIP的通用图像嵌入实现实例级预测。该模型能够学习颜色与风格匹配等视觉对应关系,生成功能更合理且美学效果更优的场景。在3D-FRONT数据集上的评估表明,本模型在场景合成任务中达到最先进水平,并将自动补全指标提升超过50%。此外,基于嵌入的方法支持零样本文本引导的场景合成与编辑,可轻松泛化至训练中未见的家具类型。