Current methods for generating 3D scene layouts from text predominantly follow a declarative paradigm, where a Large Language Model (LLM) specifies high-level constraints that are then resolved by a separate solver. This paper challenges that consensus by introducing a more direct, imperative approach. We task an LLM with generating a step-by-step program that iteratively places each object relative to those already in the scene. This paradigm simplifies the underlying scene specification language, enabling the creation of more complex, varied, and highly structured layouts that are difficult to express declaratively. To improve the robustness, we complement our method with a novel, LLM-free error correction mechanism that operates directly on the generated code, iteratively adjusting parameters within the program to resolve collisions and other inconsistencies. In forced-choice perceptual studies, human participants overwhelmingly preferred our imperative layouts, choosing them over those from two state-of-the-art declarative systems 82% and 94% of the time, demonstrating the significant potential of this alternative paradigm. Finally, we present a simple automated evaluation metric for 3D scene layout generation that correlates strongly with human judgment.
翻译:当前从文本生成三维场景布局的方法主要遵循声明式范式,即由大型语言模型(LLM)指定高层约束,再由独立的求解器进行解析。本文挑战了这一共识,提出了一种更直接、更具指令性的方法。我们让LLM生成一个分步程序,该程序迭代地将每个对象相对于场景中已有对象进行放置。这种范式简化了底层的场景描述语言,使得创建更复杂、多样且高度结构化的布局成为可能,而这些布局在声明式范式中难以表达。为提高鲁棒性,我们引入了一种新颖的、无需LLM的纠错机制作为补充,该机制直接在生成的代码上运行,通过迭代调整程序内的参数来解决碰撞和其他不一致问题。在强制选择感知研究中,人类参与者压倒性地倾向于我们的指令式布局,分别以82%和94%的比例选择了我们的方法而非两种最先进的声明式系统,这证明了这一替代范式的巨大潜力。最后,我们提出了一种简单的三维场景布局生成自动评估指标,该指标与人类判断高度相关。