Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo-realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2-THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods
翻译:经验目标视觉重排任务是具身人工智能领域的一项基础性挑战,要求智能体构建能够准确捕捉目标状态的鲁棒世界模型。智能体利用该世界模型将打乱后的场景恢复至原始配置,因此对世界的精确表征对于成功完成任务至关重要。本研究提出了一种新颖框架,利用3D高斯泼溅作为三维场景表征来解决经验目标视觉重排任务。以3D高斯泼溅为代表的体素场景表征技术最新进展,能够实现高质量、照片级真实感新视角的快速渲染。我们的方法使智能体能够获得重排任务当前状态与目标设定的一致视角,从而直接在图像空间中对世界目标状态与打乱状态进行比较。为进行视角对比,我们提出采用密集特征匹配方法,结合基础模型提取的视觉特征,利用其更具普适性的特征表征优势,从而增强鲁棒性与泛化能力。我们在AI2-THOR重排挑战基准上验证了所提方法,并证明了其相对于当前最先进方法的性能提升。