Learning to Place Objects with Programs and Iterative Self Training

In this work we study indoor scene object placement. Given a 3D indoor scene and an object, the task is to predict placement locations within the scene. Empirical observations of data-driven approaches to the problem show their tendency to miss placement modes. We introduce a system which helps to address this flaw. We design a Domain Specific Language (DSL) that specifies object relational constraints. Upon execution, programs from our language predict possible placements from a partial scene and object. We design a generative model which writes these programs automatically. Available 3D scene datasets do not contain programs to train on, and naively extracted programs only predict the original placement location of scene objects. Training on these programs results in subpar performance so we introduce a new program bootstrapping algorithm that improves our system's performance compared to the naive approach. To quantify our qualitative observations, we introduce a new evaluation procedure which captures how well a system models per-object location distributions. We ask human annotators to label all the possible places an object can go in a scene and compare this set against locations produced by the system in question. Our system produces per-object location distributions more consistent with human annotators than those produced by existing data-driven approaches and a zero-shot approach using an LLM. While other systems degrade in performance when training data is sparse, our system does not degrade to the same degree.

翻译：本研究聚焦于室内场景中的物体放置问题。给定一个三维室内场景及一个物体，任务目标是预测该物体在场景中的可放置位置。对现有数据驱动方法的经验观察显示，这类方法易遗漏放置模式。为此，我们提出一种能弥补这一缺陷的系统。我们设计了一种领域特定语言（DSL），用于描述物体间的空间关系约束。通过执行该语言编写的程序，可从局部场景与物体出发预测可能的放置位置。我们进一步构建了一个可自动生成这类程序的生成式模型。现有三维场景数据集缺乏用于训练的程序，而直接提取的程序仅能预测场景中物体的原始放置位置。基于这些程序进行训练会导致性能欠佳，因此我们提出了一种新型程序自举算法，相较于朴素方法显著提升了系统性能。为量化定性观察结果，我们引入了一项新的评估流程，用于衡量系统对每个物体位置分布建模的准确性。我们邀请人工标注员标记场景中所有可能的物体放置位置，并将该集合与待评测系统生成的放置位置进行对比。实验表明，与现有数据驱动方法及基于大语言模型（LLM）的零样本方法相比，本系统生成的每个物体放置位置分布与人工标注结果更为一致。值得注意的是，当训练数据稀疏时，其他系统的性能会显著下降，而本系统的退化程度则远低于同类方法。