The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.
翻译:具身人工智能的发展高度依赖于大规模、可仿真的3D场景数据集,这些数据集需具备场景多样性和现实布局。然而,现有数据集通常存在数据规模或多样性受限、布局缺乏小物品且过于规整、以及严重的物体碰撞等问题。为解决这些不足,我们提出了**InternScenes**——一个新颖的大规模可仿真室内场景数据集,通过整合三种不同来源的场景(真实世界扫描、程序化生成场景和设计师创建场景),包含约4万个多样化场景、196万个3D物体,覆盖15种常见场景类型和288个物体类别。我们特别保留了场景中的大量小物品,从而形成平均每个区域41.5个物体的现实复杂布局。我们全面的数据处理流程通过为真实世界扫描创建真实到仿真副本以确保可仿真性,通过将交互式物体融入场景以增强交互性,并通过物理模拟解决物体碰撞问题。我们通过场景布局生成和点目标导航两个基准应用展示了InternScenes的价值,两者均表明了复杂现实布局带来的新挑战。更重要的是,InternScenes为扩展这两类任务的模型训练规模铺平道路,使在如此复杂场景中的生成与导航成为可能。我们承诺开源数据、模型和基准测试,以惠及整个研究社区。