Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
翻译:训练具身智能体像人类一样理解3D场景,需要大量人与多样化环境进行有意义交互的数据,然而此类数据极为稀缺。真实世界的动作捕捉成本高昂且局限于受控环境,而现有合成数据集依赖忽略丰富场景上下文的简单几何启发式方法。相比之下,在互联网级数据上训练的2D基础模型已隐式习得人类-环境交互的常识知识。为将这一知识迁移至3D领域,我们提出InHabit——一种全自动、可扩展的数据生成器,用于在3D场景中填充交互人体。InHabit遵循"渲染-生成-提升"原则:给定渲染后的3D场景,视觉语言模型提出上下文相关的有意义动作,图像编辑模型插入人体,优化流程将编辑结果提升为与场景几何对齐的物理合理SMPL-X人体模型。应用于Habitat-Matterport3D场景后,InHabit生成了首个大规模逼真3D人-场景交互数据集,包含覆盖800个建筑级场景的78K个样本,完整提供3D几何、SMPL-X人体模型及RGB图像。将我们的样本扩充标准训练数据后,基于RGB的3D人-场景重建与接触估计性能得到提升;在感知用户研究中,78%的案例中我们的数据被优先选择,优于现有最先进方法。