3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. The data and code will be available soon.
翻译:3D高斯泼溅(3DGS)作为一种具备照片级真实感实时渲染能力的3D表示方法,被认为是缩小仿真与现实差距的有效工具。然而,对于视觉语言导航(VLN)任务,它缺乏细粒度语义和物理可执行性。为解决这一问题,我们提出SAGE-3D(面向3D导航的语义与物理对齐高斯环境),这是一种将3DGS升级为可执行、语义与物理对齐环境的新范式。它包含两个组成部分:(1)以对象为中心的语义接地,为3DGS添加对象级细粒度标注;(2)物理感知的执行连接,将碰撞物体嵌入3DGS并构建丰富的物理接口。我们发布了InteriorGS数据集,包含1K个带对象标注的3DGS室内场景数据,并推出了首个基于3DGS的VLN基准测试SAGE-Bench,包含2M条VLN数据。实验表明,3DGS场景数据更难收敛,但展现出强大的泛化能力,在VLN-CE Unseen任务上将基线性能提升了31%。数据与代码即将公开。