Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI -- especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.
翻译:生成高保真、物理交互的三维模拟桌面场景对于具身人工智能至关重要——特别是在机器人操作策略学习和数据合成领域。然而,当前基于文本或图像的3D场景生成方法主要聚焦于大规模场景,难以捕捉桌面场景特有的高密度布局与复杂空间关系。为解决这些挑战,我们提出了TabletopGen,一个无需训练、全自动的框架,能够生成多样化的实例级交互式3D桌面场景。TabletopGen以参考图像作为输入,该图像可通过文本到图像模型合成以增强场景多样性。随后我们对参考图像进行实例分割与补全,获得每个实例的独立图像。每个实例被重建为3D模型并进行规范坐标对齐。对齐后的3D模型经过姿态与尺度估计后,被组装成无碰撞、可直接用于仿真的桌面场景。我们框架的核心是一种新颖的姿态与尺度对齐方法,它将复杂的空间推理解耦为两个阶段:用于精确旋转恢复的可微旋转优化器,以及用于鲁棒平移与尺度估计的俯视图空间对齐机制,从而实现了从2D参考图像到准确3D重建的转换。大量实验和用户研究表明,TabletopGen达到了最先进的性能,在视觉保真度、布局准确性和物理合理性方面显著超越现有方法,能够生成具有丰富风格与空间多样性的逼真桌面场景。我们的代码将公开提供。